Laying the foundation for future ceph cluster, does it fit usecase?

Side2005

New Member
Dec 11, 2020
2
0
1
31
Hey all,

long time lurker, first time poster here :)
Following problem:
  • It is required to have up to 100 users access a central service
  • The service consists of an user application running on Windows Server 2019 and a linux-backed database (Debian 10 and FirebirdDB)
  • Those 100 users are split across different offices across the country (Germany)
  • Required is high-availability during business hours (mon-sat, 8am-6pm). Downtime of <15 minutes is acceptable.
  • Project is still in design phase. No Hardware is purchased yet (Neither Server nor Networking equipment)
  • Administrators are off-site and it takes at least half a day to physically reach the servers
  • Only one office has a proper server room. Thats why we think about colocating the servers in a datacenter. Does Ceph even make sense in this scenario?
I do use Proxmox on my own (homelab) and im familiar with Proxmox&ZFS file system. But I did not work with ceph yet.
The plan for the environment is to use proxmox ha ceph cluster with 3x nodes and a Business Subscription.

Specs per node (edit2: different NICs):
1x CSE-116AC2-R706WB2
1x Supermicro H11DSi-NT , 2x 10gbit/rj45 NIC onboard, 2x CPU possible
1x EPYC 7302
2x 64GB ram 3200mhz, up to 16x DDR4 possible
1x 4TB Intel P4510 U.2 SSD (connected via NVMe) for OSD
2x 480GB Samsung PM883 SSD for Proxmox OS
1x AOC-S25G-m2S 2x25gbit/sfp28
1x Intel i210-T1 1x1gbit/rj45



The plan is to first start with 2 VMs (One on each node) and see how the system behaves. If the service (1x Windows Server, 1x Linux Server) mentioned above is working flawlessy, we might add more VMs to the cluster (for example a bare-metal exchange server that we want to virtualize).
The case can house up to 2x NVME U.2 and 8x SATA6 2,5" SSDs, 1x NVME will be populated aswell as 2xSATA for Proxmox OS (raid1) if first purchased. We first want to scale the cluster vertically (by adding 2nd CPU, more RAM, more SSDs) before we go the route with horizontal scaling (adding more nodes), due to high network equipment requirements.

Questions regarding hardware:
QH1:
The XL710-QDA2 will be used to directly connect each node with each other by DAC cables (peer2peer), which will be the cluster network. The 2x10Gbit onboard NICs will be used for public/proxmox/userclient networking via 10G switch. Do i need to add another NIC for corosync/heartbeat network or is the current networking sufficient?
Solution: Add NIC for dedicated network (not sufficient)

QH2:
If we want to expand the fast NVME storage, we do have to purchase 3x4TB, put one into each node and configure the 4TB NVME as OSD and add it to ceph_nvme pool. This will double the usable space and usable IO performance (if the networking bandwidth allows it), correct?
Solution: Yes, not linear

QH3:
Point:
  • Administrators are off-site and it takes at least half a day to physically reach the servers
Should we plan with 4 nodes to have "room" for failover? OR is a 3-node setup with decent server hardware resilient enough? Remember it is not required to have 24x7 availability.

Question regarding project
QP1:
Will scaling vertically work in this project? My biggest concern is the storage. The database and the user application is not going to use alot (less than 1TB) of space. IO performance is way more important, thats why we want both initial VMs to be on fast NVME storage.
Later on we do want to populate the empty 6x SATA 2,5" bays with cheaper SATA SSDs (and maintain 2 different pools, ceph_nvme and ceph_ssd) for future vms that do not require fast NVME performance.

QP2:
Each node will process VMs and handle data storage if first built. Later on, if the project is succesful and the need for more space comes up, we want to seperate storage and computing nodes.
The "dream" would be to upgrade the initial 3 nodes to maximum computing power, yank all the then installed OSDs, purchase additional 3x storage nodes, populate those with the existing OSDs, throw in 3,5" drives and add a 3rd pool ceph_hdd.
So we do have 3x Proxmox Hosts with mon, mgr on each node and a "real" Ceph storage cluster with 3x nodes.
Does this work well with Proxmox? What would be the process of transforming a Proxmox HA ceph to a Proxmox Host & Ceph storage backend?

Questions regarding Ceph:
QC1:
Plan is to run 3/2 rules (3 replicas with 2 minimum copies), which allows to take down one node for maintenance (upgrading software/hardware etc.). Good choice? What are the culprits? Anyone got a better idea?

QC2:
I just started learning Ceph. Any good references besides RedHats documentation (books etc.). Im currently playing with my proxmox testlab (3x virtualized Proxmox nodes with CephFS installed, 6GB RAM each) and want to test out various scenarios (URE of a node, replacing Disks, replacing Node, adding additional nodes etc.).

QC3:
Regarding
  • Only one office has a proper server room. Thats why we think about colocating the servers in a datacenter. Does Ceph even make sense in this scenario?
Can someone deliver input on this please?

Im really happy for every advice i can soak up.

Hope to find answers soon
Fabius


Edit1: typos
 
Last edited:
The XL710-QDA2 will be used to directly connect each node with each other by DAC cables (peer2peer), which will be the cluster network. The 2x10Gbit onboard NICs will be used for public/proxmox/userclient networking via 10G switch. Do i need to add another NIC for corosync/heartbeat network or is the current networking sufficient?
Corosync is essential for the cluster to work, especially if you want to use HA. Corosync doesn't need a lot of bandwidth but really needs low latency. If Corosync is sharing the physical network with other services, you can run into the situation that these other services might congest the network. This means that the latency for Corosync will go up and if it is too high that link will not be considered up any more by Corosync and it will not be connected to the Cluster.

If you have HA enabled and a Node is losing the connection to the quorum part of the cluster and has (or had since the last reboot) a HA enabled guest running, it will fence itself (hard reset) if it cannot reestablish the connection to the cluster fast enough. If the reason for the connection loss is a congested network, it usually affects all nodes which then shows itself in the effect that all nodes cannot form a quorate cluster anymore and all will fence themselves.

QH2:
If we want to expand the fast NVME storage, we do have to purchase 3x4TB, put one into each node and configure the 4TB NVME as OSD and add it to ceph_nvme pool. This will double the usable space and usable IO performance (if the networking bandwidth allows it), correct?
If the other components (network, RAM, CPU) have enough resources for it then yes, you will have double the space and better IO performance, though the performance will not scale perfectly to 2x of the previous.

QH3:
Point:
  • Administrators are off-site and it takes at least half a day to physically reach the servers
Should we plan with 4 nodes to have "room" for failover? OR is a 3-node setup with decent server hardware resilient enough? Remember it is not required to have 24x7 availability.
It depends on your failure domains and what you are willing to spend. If you go into separate rooms or fire sections the setup gets very complicated quickly as you have to ensure that whichever section stays alive needs the majority (Corosync, Ceph Mons). Therefore, you need a third section which provides that vote to get the majoritry.
An easier approach, considering that this failure scenario is less likely to happen, is to have a separate stand by cluster to which you replicate the VM configs and disks on a regular basis.

QP1:
Will scaling vertically work in this project? My biggest concern is the storage. The database and the user application is not going to use alot (less than 1TB) of space. IO performance is way more important, thats why we want both initial VMs to be on fast NVME storage.
Later on we do want to populate the missing 6x SATA 2,5" bays with cheaper SATA SSDs (and maintain 2 different pools, ceph_nvme and ceph_ssd) for future vms that do not require fast NVME performance.
If IO is important you have to consider the full IO stack which for Ceph is roughly the following:

Code:
Application -> guest OS -> Virtualization -> Network -> primary OSD -> network -> secondary OSD
                                                                    |
                                                                    -> network -> tertiary OSD

Therefore, to reduce the latency on the network you should consider to use 25G NICs instead of 40G. The reason is that 40G networks are usually done with QSFP which aggregates 4x 10G. So you have 4 times the bandwidth but still the latency of a 10G network. Because 25G runs on a single SFP, the latency is quite a bit lower.

QP2:
Each node will process VMs and handle data storage if first built. Later on, if the project is succesful and the need for more space comes up, we want to seperate storage and computing nodes.
The "dream" would be to upgrade the initial 3 nodes to maximum computing power, yank all the then installed OSDs, purchase additional 3x storage nodes, populate those with the existing OSDs, throw in 3,5" drives and add a 3rd pool ceph_hdd.
So we do have 3x Proxmox Hosts with mon, mgr on each node and a "real" Ceph storage cluster with 3x nodes.
Does this work well with Proxmox? What would be the process of transforming a Proxmox HA ceph to a Proxmox Host & Ceph storage backend?
This is possible. To separate the clusters cleanly I recommend having two separate clusters. One to manage Ceph and the other for the actual VMs. You can then configure the compute cluster to connect to the Ceph cluster (external Ceph). This way you will not by accident migrate a VM to the Storage cluster.

The process could look like this: Add the storage cluster. On the compute cluster configure the external pools from the storage cluster. Move the VM disks to the new pools.

If you want to reuse the SSDs this will be a bit more complicated because you can only remove them from the old (now compute) cluster once the disks are not stored on them anymore.

QC1:
Plan is to run 3/2 rules (3 replicas with 2 minimum copies), which allows to take down one node for maintenance (upgrading software/hardware etc.). Good choice? What are the culprits? Anyone got a better idea?
This is the default and works well. Ceph by default waits for about 10 minutes until it decides that an OSD is not part of the cluster anymore. So normal reboots after updates are no Problem. If OSDs are down for longer, Ceph will try to recreate the third replica on other nodes. This obviously will only work if there are more than 3 nodes in the cluster. If you fear that you will lose more than 1 node at a time you will have to increase the number of nodes (by 2 as an odd number of nodes is good to avoid split brain situations) and then you can increase the size/min_size accordingly.

It's always a balance of how failure safe you want to make it and how much money you are willing / able to spend ;)

QC2:
I just started learning Ceph. Any good references besides RedHats documentation (books etc.). Im currently playing with my proxmox testlab (3x virtualized Proxmox nodes with CephFS installed, 6GB RAM each) and want to test out various scenarios (URE of a node, replacing Disks, replacing Node, adding additional nodes etc.).
The official Ceph docs[0] are good to get to know the Ceph internals. Then of course the Ceph chapter in our documentation [1] and especially the pveceph tool (if you like the CLI) which helps with many tasks that would otherwise involve a lot of manual steps.

QC3:
Regarding
  • Only one office has a proper server room. Thats why we think about colocating the servers in a datacenter. Does Ceph even make sense in this scenario?
If you have full control of the network, so you can provide enough physical links to separate the network why not.
Should you not have seen it yet, we have a wiki page describing a full mesh network [2] for Ceph. This can be useful if you only need a 3 node cluster which you don't plan to expland in the near future as you don't have to spend money on fast switches, plus without switches the latency will be a bit lower.


May I give you another idea of what could be possible? If it is just for the initial use case, a DB and application VM you could approach this differently.

You could create a 2node cluster with a ZFS storage + Qdevice [3] (to still 2 out of 3 votes if a node fails). For the DB have a VM on each node and use replication on the DB level to keep them in sync and use a floating IP for client access.
The application server most likely cannot be replicated on the application layer. Therefore, you can set up disk replication (needs ZFS) between the 2 nodes and configure the application VM as a HA VM. If the node on which the Application VM is located dies, it will be started after about 3 minutes on the other node.
Depending on the disk replication interval there might be some data loss. The shortest interval is 1 minute. Not knowing the application that should be running on it, I assume the best of which is that all the important data is in the DB ;)


At last, a Proxmox VE training might be of interest for you. Check out the contents and possible dates offered by our training partners and ourselves on our website [4].


[0] https://docs.ceph.com
[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_pveceph
[2] https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server
[3] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_corosync_external_vote_support
[4] https://www.proxmox.com/en/training
 
Corosync is essential for the cluster to work, especially if you want to use HA. Corosync doesn't need a lot of bandwidth but really needs low latency. If Corosync is sharing the physical network with other services, you can run into the situation that these other services might congest the network. This means that the latency for Corosync will go up and if it is too high that link will not be considered up any more by Corosync and it will not be connected to the Cluster.

If you have HA enabled and a Node is losing the connection to the quorum part of the cluster and has (or had since the last reboot) a HA enabled guest running, it will fence itself (hard reset) if it cannot reestablish the connection to the cluster fast enough. If the reason for the connection loss is a congested network, it usually affects all nodes which then shows itself in the effect that all nodes cannot form a quorate cluster anymore and all will fence themselves.
Therefore, to reduce the latency on the network you should consider to use 25G NICs instead of 40G. The reason is that 40G networks are usually done with QSFP which aggregates 4x 10G. So you have 4 times the bandwidth but still the latency of a 10G network. Because 25G runs on a single SFP, the latency is quite a bit lower.

I removed the 2x40QSFP+ NIC and replaced it with a Supermicro AOC-S25G-m2S 2x25gbit/sfp28 and added an Intel i210-T1 1x1gbit for Corosync. So per node i now have 2x25gbit/sfp28 (DAC, cluster), 2x10gbit/rj45 (public), 1x1gbit/rj45 (corosync).
After checking PVE documentation (thanks for the links) there is no need for corosync link redundancy since it will use all available networks with lower priority, correct?




Code:
Application -> guest OS -> Virtualization -> Network -> primary OSD -> network -> secondary OSD
                                                                    |
                                                                    -> network -> tertiary OSD
This is the model for the full CEPH stack, right?
If vms reside on a node with a local OSD, there is usually no networking between the Virtualization Host(PVE) and Primary OSD, because for a read event all necessary data resides on the node the vm lives in.Only the secondary/tertiary OSD needs to be contacted via network, in case not all necessary data do not live on the primary OSD. Or do I have a wrong understanding in how the data is stored in PGs and how they are distributed on the cluster. I think I need to dig deeper into the ceph documentation before I go any further.
So this is important if we want to transform to an "actual" ceph-cluster later on, but not for now?


Should you not have seen it yet, we have a wiki page describing a full mesh network [2] for Ceph.
Thats the current plan (3 nodes with meshed cluster network). And we definitely will go for a serverhousing/colocation solution (makes me sleep better :D )


You could create a 2node cluster with a ZFS storage + Qdevice [3] (to still 2 out of 3 votes if a node fails). For the DB have a VM on each node and use replication on the DB level to keep them in sync and use a floating IP for client access.
The application server most likely cannot be replicated on the application layer. Therefore, you can set up disk replication (needs ZFS) between the 2 nodes and configure the application VM as a HA VM. If the node on which the Application VM is located dies, it will be started after about 3 minutes on the other node.
Depending on the disk replication interval there might be some data loss. The shortest interval is 1 minute. Not knowing the application that should be running on it, I assume the best of which is that all the important data is in the DB
I had a similar plan without HA and without a 3rd QDevice:
On my primary job we do use Debian + Qemu/KVM on ZFS with sanoid/syncoid for replication with manual stopping/starting of the vms. I wanted to have this cluster setup in a similar manner. But it does require manual intervention of an admin, thats why i moved away from this plan.
After checking docs, a setup with 2node/1QDevice activates HA the same way proxmox ceph HA handles HA (fencing etc.), correct? I might look into it.



I will look into training!


Cheers
Fabius

edit1: typo
edit2: rewording of thoughts on Diagram
edit3: can you maybe tell me more about the 2 different pools? "Later on we do want to populate the empty 6x SATA 2,5" bays with cheaper SATA SSDs (and maintain 2 different pools, ceph_nvme and ceph_ssd) for future vms that do not require fast NVME performance."
 
Last edited:
After checking PVE documentation (thanks for the links) there is no need for corosync link redundancy since it will use all available networks with lower priority, correct?
Yes you can configure up to 8 corosync links but keep in mind that this should be a measure of last resort, and you do not know how congested these networks are. If you can spare another 1GBit link I would do it, especially if you want to use HA. Without HA enabled you might run into the situation that you cannot start a VM or change a config, but the currently running guests willl keep running. While with HA enabled you are risking fencing nodes if the cluster communication is not guaranteed.

This is the model for the full CEPH stack, right?
If vms reside on a node with a local OSD, there is usually no networking between the Virtualization Host(PVE) and Primary OSD, because for a read event all necessary data resides on the node the vm lives in.Only the secondary/tertiary OSD needs to be contacted via network, in case not all necessary data do not live on the primary OSD. Or do I have a wrong understanding in how the data is stored in PGs and how they are distributed on the cluster. I think I need to dig deeper into the ceph documentation before I go any further.
So this is important if we want to transform to an "actual" ceph-cluster later on, but not for now?
Ceph does not know any locality. Just think about how it would work in a 4 node cluster and a size/min_size of 3/2 ;)
Even in a 3 node cluster the primary OSD does not necessarily need to be on the current node. On sync writes, the ACK will only be delivered to the guest once the ACK is back from all OSDs. The diagram in the Ceph architecture documentation [0] shows this nicely.

Some PGs will definitely have their primary OSD on the local node in a 3 node cluster, but not all.
After checking docs, a setup with 2node/1QDevice activates HA the same way proxmox ceph HA handles HA (fencing etc.), correct? I might look into it.
What you need to keep in mind with ZFS replication is that you might lose some data. Because it is not a shared storage but replicated at a regular interval. For example, if you set the interval to 10 minutes you could be lucky and the one host fails right just after the replication, and you likely lose no or very little data. Or it could be that the host fails one minute before the next replication, and you lose 9 minutes of data.

That is why I recommended setting up replication on the DB level between two VMs to keep them in sync and avoid that problem. This is also valid if you run DB servers on a Ceph cluster. Depending on how much the DB caches you might lose some data if the host fails because the DB didn't write them down to disk (Ceph) yet.

edit3: can you maybe tell me more about the 2 different pools? "Later on we do want to populate the empty 6x SATA 2,5" bays with cheaper SATA SSDs (and maintain 2 different pools, ceph_nvme and ceph_ssd) for future vms that do not require fast NVME performance."
Creating CRUSH rules which decide on the device class help with this. You create a rule for each device class and assign that to the pools. The Ceph docs cover this nicely [1].



[0] https://docs.ceph.com/en/latest/architecture/#smart-daemons-enable-hyperscale
[1] https://docs.ceph.com/en/latest/rados/operations/crush-map/#device-classes

Edit: removed vague phrasing reg. ZFS replication
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!