Proxmox VE CEPH configuration help needed!

Jan 11, 2024
5
0
1
I will preface this by saying I am a total ProxMox/CEHP noob... I am good with Linux and I have been a VMware Admin/Engineer for the better part of 2 decades...

I have looked through dozens of walk-thoughts and videos but I am still struggling with CEPH and I need some help.
Goal:
HA VM/Storage Cluster with Tiered Storage Capability
NVME
SSD
Spinner

Current Environment:
4x Compute Nodes Dell PowerEdge R650 w/
1TB RAM
2x 64GB Sata-Dom ProxMox OS
3x 2TB SSD RAIDz ZFS Local Storage

3x SuperMicro Mass storage servers w/
512GB RAM
2x 64GB Sata-Dom ProxMox OS
4x 2TB NVME
12x 2TB SSD
14x 16TB HDD Spinner Drives

40Gbps interface dedicated CEPH/Cluster Network
4x10Gbps Bonded Client Network

Any help or advice would be greatly appreciated!
 
Seems pretty straight forward. Without knowing what your use case is, seems like 3 pools by device class. nvme pool and ssd pools for "faster" and "fast" rbd, the HDD pool for cephfs.

I also want to note that ceph requires TWO networks, not one- if you're going to use a single interface for both public and private traffic expect a maximum possible performance of half the link speed. Not likely to impact you much with a 40gb interface but worth a mention.

if you want to be more specific about what you want to accomplish might have better (more tailored) suggestions.
 
Sorry, but I have to catch this straight away so it doesn't become a habit. The company is called Proxmox (no capital M), the product you use is called Proxmox VE.

See: https://www.proxmox.com/en/about/media-kit
I strongly advise you not to use SATA DOMs or SD cards for Proxmox VE. You may be familiar with this from VMware, but with PVE you'll be replacing parts more often than you can imagine. The parts are intended to be read-only, but Debian writes significantly more than VMware. At least I've only had bad experiences with it, preferring two Samsung SM or PM with 120 or 240GB.
3x 2TB SSD RAIDz ZFS Local Storage
I would simply include it in the CEPH. Or do you have special requirements such as a high performance database?

40Gbps interface dedicated CEPH/Cluster Network
4x10Gbps Bonded Client Network
If you already have the hardware, use it. If not and it's within the price range, go straight to 100 G standard. Especially the fat storages will thank the 50 or 100G. With 40 GbE you don't get much out of it.
 
I strongly advise you not to use SATA DOMs or SD cards for Proxmox VE
Not wrong, but in case you have no other option this can be mitigated somewhat by moving /var off disk.

I would simply include it in the CEPH.
seconded. as a general rule, avoid raidz1 like the plague, and avoid parity raid for virtual disk use.

go straight to 100 G standard. Especially the fat storages will thank the 50 or 100G. With 40 GbE you don't get much out of it.
I beg to differ. with only 4 compute nodes its unlikely to ever generate sufficient load to stress even 10gb, much less 40. where 25gbit ethernet truly shines (and by inference 50 and 100gbit) is the substantially better latency. loaded DBs would benefit from 25gbit over 10 or 40gbit. what sucks is just how much more expensive it is per port...
 
Not wrong, but in case you have no other option this can be mitigated somewhat by moving /var off disk.
Of course you can do that, but I wouldn't install junk to find a work-around that ultimately doesn't bring me any further benefit and makes my system even more complex.
If I put /var on an SSD in the bays, I won exactly nothing, and I can no longer use it as an OSD. If so, I added unnecessary complexity to 4 discs. So from my point of view your solution doesn't make much sense, then just do without 2 OSDs and install Proxmox VE on them.
I beg to differ. with only 4 compute nodes its unlikely to ever generate sufficient load to stress even 10gb, much less 40. where 25gbit ethernet truly shines (and by inference 50 and 100gbit) is the substantially better latency. loaded DBs would benefit from 25gbit over 10 or 40gbit. what sucks is just how much more expensive it is per port...
But have you already seen that it has 4 computing nodes and 3 storage nodes? If he also includes the NVMe of the computing nodes, there are a total of 7 nodes in the CEPH.
10GbE corresponds to 1.25 Gb, which alone is enough for an NVMe and we are still throttled. In the new CEPH benchmark from Proxmox [0], this was exactly what could be determined in a 3 node MESH setup.

The storage servers should be operated with a total of 30 disks, 16 of which are flash memory. Not only will the latency be better with 100G standard, but also the bandwidth and possible expandability.

[0] https://forum.proxmox.com/threads/p...eds-in-a-proxmox-ve-ceph-reef-cluster.137964/
 
You're talking about bandwidth; I'm talking about load. I'm not saying 100gbit is not without merit, I'm saying that this setup isn't likely to benefit much vs the cost associated.
n the new CEPH benchmark from Proxmox [0], this was exactly what could be determined in a 3 node MESH setup.
The thing about benchmarks is showing what the limits are. saying "my storage can sustain 1m iops" isnt the same as actually generating IOPS in actual production.
 
You're talking about bandwidth;
Not only, latency too.
I'm not saying 100gbit is not without merit,
I never said to use 100G, I just said the 100G standard. So for example SFP28 and therefore 25 G and not 100 G.
The thing about benchmarks is showing what the limits are.
Right, and when an NVMe can max out my uplink, I've reached the point where I realize that 4x 10 GbE is simply not suitable for the cluster.

But it doesn't matter, ultimately it's just my opinion that you shouldn't run NVMe with 10 G and actually no longer with 25 G or 40 G, but directly with 2x 100 G. The nodes here are also quite large, so a high bandwidth makes sense. For a node with 6 SATA SSDs, 2x 10 GbE is definitely okay, for 16 SSDs I would also use at least 40 GbE. Ultimately, the network is never so expensive that I would get it as a bottleneck and then have to replace everything. The basis should therefore be chosen so that it can also scale with growth. But as I said, if it's there, use it and if you're buying new anyway, then it's a good idea.
 
Guys thank you for your input,

Use case:
We are a Wireless ISP, we host 99% of our applications and web resources in house. The majority of our applications run on RHEL our Debian with a couple of Windows Servers/SQL Servers for applications that do not support running on Linux i.e. SolarWinds (the Solarwids database does have a HIGH I/O.)
We have been running VMware for years however, VMware support has fell on its face recently and we have been waiting 3 months for new licensing that is already paid for... Thank you Broadcom.... We are hoping to move over to PROXMOX as an alternative to VMware.

on this note... I have been testing migrating VMs from our VMware environment to PROXMOX... and almost without fail I have been getting initramfs errors any advice on this issue would be great as well...

Network interfaces:
All Nodes 2x10G Bonded LACP Client Net AND 4x10 Bonded LCAP CEPH/Cluster Net (Compute Nodes actually have 25G capable interfaces however we do not currently have the optics to run 25G on them)

**I would love to go to a full 100G net however, we are primarily a MikroTik shop and so far MikroTik as only came out with switches and routers that have a max of 2x 100G QSFP interfaces

Local Storage ZFS:
Honestly it's there because the compute nodes have the storage and I have not figured out how to store ISOs and Container Templates in CEPH

Sata Doms:
We have a bunch laying around and it sounded like a good idea at the time... I will move the OS over to a couple in bay SSDs

CEPH Storage:
My thought was:
NVME array
SSD array
Fast Spinner (3:1 16TB Spinner to 4TB SSD) I figured out that I could only go 1:1 with a 16TB Spinner and 2TB SSD for DB/WAL so we have added 4x 4TB SSDs to the plan and cut back the number of 16TB Spinners to 12 on each NAS Node

I still need some help with developing crush maps for the different storage arrays
If I make the local storage CEPH rather than ZFS is there a way to store ISOs/Container Templates in CEPH?
How would I keep the smaller pool on the Compute Nodes from being absorbed into the larger pools for the NAS Nodes?
 
Last edited:
and almost without fail I have been getting initramfs errors any advice on this issue would be great as well...
How are you migrating the VMs? you're getting those errors because something is different with the storage (eg, vg name, host bus, etc.) this can be fixed after the fact with a livecd, but would be easier to use a different method to migration. see https://pve.proxmox.com/wiki/Migration_of_servers_to_Proxmox_VE for options and discussion.

Local Storage ZFS:
Honestly it's there because the compute nodes have the storage and I have not figured out how to store ISOs and Container Templates in CEPH
Do you intend to roll the hardware into your ceph pool after decomissioning? if not, just leave it. zfs filers are perfect for the usecase you describe.

My thought was:
NVME array
SSD array
Thats fine in theory, but I'd suggest a bit more planning, namely:
1. How many nodes in your cluster? will you keep separate compute and storage nodes? if yes, how many of these will be used for ceph/osd?
2. How many VMs are you deploying?
3, For each VM, separate the storage use type; eg High IOPs, mid IOPs, bulk storage. What is the total necessary capacity in GB for each type?
4. What is your backup/DR strategy?

As with Vsphere, you should plan on no more than N-1 load on your compute nodes.
Fast Spinner (3:1 16TB Spinner to 4TB SSD) I figured out that I could only go 1:1 with a 16TB Spinner and 2TB SSD for DB/WAL so we have added 4x 4TB SSDs to the plan and cut back the number of 16TB Spinners to 12 on each NAS Node
will have more to say after response, but in GENERAL I'd not bother mixing HDD and SDD pools. your SSDs will be better used in their own pools.
All Nodes 2x10G Bonded LACP Client Net AND 4x10 Bonded LCAP CEPH/Cluster Net
Only change I'd recommend is to keep all interfaces in pairs. that way you can separate ceph public and private traffic more effectively. IDEALLY each pair should be connected to different switches for redundancy.
I would love to go to a full 100G net however, we are primarily a MikroTik shop
I will only say that I would choose equipment that has a reputation for...umm... uptime.
 
How are you migrating the VMs? you're getting those errors because something is different with the storage (eg, vg name, host bus, etc.) this can be fixed after the fact with a livecd, but would be easier to use a different method to migration. see https://pve.proxmox.com/wiki/Migration_of_servers_to_Proxmox_VE for options and discussion.
I was following these articles
https://edywerder.ch/vmware-to-proxmox
https://knowledgebase.45drives.com/...-virtual-machine-disks-from-vmware-to-proxmox
Do you intend to roll the hardware into your ceph pool after decomissioning? if not, just leave it. zfs filers are perfect for the usecase you describe.
Currently one of the Compute Nodes and one of the NAS nodes are still supporting our VMware Environment in a limited capacity while we test and figure out the path forward with PROX/CEPH

We also have 4x Servers currently running as TV/IPTV Transcoders with significant GPU resources that we are looking to convert into Compute Nodes as well but that will be a later date project
Thats fine in theory, but I'd suggest a bit more planning, namely:
1. How many nodes in your cluster? will you keep separate compute and storage nodes? if yes, how many of these will be used for ceph/osd?
2. How many VMs are you deploying?
3, For each VM, separate the storage use type; eg High IOPs, mid IOPs, bulk storage. What is the total necessary capacity in GB for each type?
4. What is your backup/DR strategy?
Current Plan is:
1. 4x Compute Nodes and 3x NAS Nodes
I have not figured out a good way to keep the storage and compute nodes separate and still have access to the storage
I had considered creating a separate storage cluster and running a NAS OS like TrueNAS on those nodes then sharing storage pools vi iSCSI but I don't know if that makes sense... I am absolutely open to suggestions on this.
2. Currently we have approximately 100VMs and 400 Docker Containers
3. Most of our environment is fairly low IOPS with exception of our SolarWinds and our IPTV applications (IPTV is currently running all on physical hardware but we want to virtualize... future project)
4. Backup/DR strategy: we intend to leverage the PROXMOX Backup Server with Spinners for Cold Storage we also have an off site secondary data center connected via dedicated 20Gbps DIA circuit with its own 100Gbps internet circuit and BGP is configured between the sites.
Ultimately we will be replicating the setup we have in our primary data center at our secondary data center for redundancy and failover.

As with Vsphere, you should plan on no more than N-1 load on your compute nodes.

will have more to say after response, but in GENERAL I'd not bother mixing HDD and SDD pools. your SSDs will be better used in their own pools.
I had looked at pairing the SSD/Spinners based off several other posts saying that it would help mitigate the inherent latency with Spinners... if the juice is not worth the squeeze doing that I am not completely committed to the idea. Again open to recommendations and advice.
Only change I'd recommend is to keep all interfaces in pairs. that way you can separate ceph public and private traffic more effectively. IDEALLY each pair should be connected to different switches for redundancy.
The CEPH network is completely isolated on a separate switch from the Client network, there is no public CEPH it is all private.
I will only say that I would choose equipment that has a reputation for...umm... uptime.
 
1. 4x Compute Nodes and 3x NAS Nodes
I have not figured out a good way to keep the storage and compute nodes separate and still have access to the storage
For your usecase, I suggest using 5 OSD nodes. remember, you can mix compute and ceph on the same hardware, but you should precalculate the resources you'd need to expend on each. in other words- consider a core and 4GB ram as "reserved" per osd, and only commit load that would fit on what remains. in actuality you wouldnt use that much resources when all is well- it just needs to be available when things go wrong.

2. Currently we have approximately 100VMs and 400 Docker Containers
I suppose I should have asked this differently. How much cores/ram does this load require at minimum, and at maximum.

3. Most of our environment is fairly low IOPS with exception of our SolarWinds and our IPTV applications
The more precise the answer, the more accurate the proposed solution can be. I dont know what this translates to. your docker containers probably dont hit disk very much, but depending on what your vms do would give you some idea of the count and type of disk mechanisms to use for OSDs for optimal results.

4. Backup/DR strategy: we intend to leverage the PROXMOX Backup Server with Spinners for Cold Storage we also have an off site secondary data center connected via dedicated 20Gbps DIA circuit with its own 100Gbps internet circuit and BGP is configured between the sites.
Ultimately we will be replicating the setup we have in our primary data center at our secondary data center for redundancy and failover.
With this much bandwidth available between your DCs, spreading your cluster across two zones could be an option.
The CEPH network is completely isolated on a separate switch from the Client network, there is no public CEPH it is all private.
"public" and "private" dont mean that in ceph parlance. "public" refers to the network servicing guest traffic (eg, rbd) while "private" refers to the network servicing OSD traffic. you have to specify both, although it is allowed to use the same network for both. there is no difference in terms of available bandwidth between a single LACP set with 4 links or 2 of 2, but latency is limited to a single link speed in either case. by splitting the networks you'd have better latency for both applications.
 
For your usecase, I suggest using 5 OSD nodes. remember, you can mix compute and ceph on the same hardware, but you should precalculate the resources you'd need to expend on each. in other words- consider a core and 4GB ram as "reserved" per osd, and only commit load that would fit on what remains. in actuality you wouldnt use that much resources when all is well- it just needs to be available when things go wrong.


I suppose I should have asked this differently. How much cores/ram does this load require at minimum, and at maximum.
on our current VMware server 2x Intel Xenon Gold 5317 12core CPUs and 1TB RAM running minimal systems with no redundancy (sigh yes I know this is horrible) we are using about 79% CPU and 45% RAM and using 600GB NVME, 800GB SSD and 18TB Spinner

Running at full capacity with redundancy and failover across 3 compute nodes we were running 80% CPU and 90% RAM usage (before the PROXMOX project started) we had a lot of VM/Resource Bloat which I am evaluating and wanting to mitigate as we move over.
The more precise the answer, the more accurate the proposed solution can be. I dont know what this translates to. your docker containers probably dont hit disk very much, but depending on what your vms do would give you some idea of the count and type of disk mechanisms to use for OSDs for optimal results.
End State I want to get to a point where we are running comfortably and balanced at or below 50% usage.
Even with backups I do not foresee utilizing all the storage we have currently allocated for this project any time in the near future and all 3 of our NAS Nodes are only about half populated with drives so we have room to grow as needed.
With this much bandwidth available between your DCs, spreading your cluster across two zones could be an option.

"public" and "private" dont mean that in ceph parlance. "public" refers to the network servicing guest traffic (eg, rbd) while "private" refers to the network servicing OSD traffic. you have to specify both, although it is allowed to use the same network for both. there is no difference in terms of available bandwidth between a single LACP set with 4 links or 2 of 2, but latency is limited to a single link speed in either case. by splitting the networks you'd have better latency for both applications.
I did not see any options for adding additional CEPH Links when I setup CEPH. Currently each Node has 3 links, Cat6 mgmt/iDrac/iPMI, Vmnet/Client Bond (2x 10Gps) and CEPH/Cluster Bond (4x10Gbps). The CEPH/Cluster net is not routable currently from the client network.

Network Traffic in our VMware environment when it was running at capacity
Client/VMnet avg 2.3Gbps
Storage Net 15Gbps
Replication/vMotion/Backup net avg 8Gbps
 
Well the Sata Doms failed as yall said they likely would. Replaced them with SSD, however, I cannot figure out how to recover my OSDs, I can see the drives under Disks and it shows the LVM Cheph OSD# but I cannot get them to mount. Any help with this would be awesome.

1708821917852.png
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!