They asked me for a CEPH deployment plan!

tcabernoch

Active Member
Apr 27, 2024
253
53
28
Portland, OR
www.gnetsys.net
We are getting serious now. The boss wants "Proxmox VSAN". I sold it. Now I gotta deliver.

Gonna take two of the big all-SSD R840 hosts out of production and use em for a new CEPH cluster.
It's not happening today, by any means. I'm gonna build it out in the lab. Demo it. _Test_ it.
When I've got a working model, we can consider moving customers around to clear the two hosts.

There's so much to plan for, but a lot of that comes later.
Other than the massive commitment of existing hardware to a pilot program, I'm considering any expenses.
I sure would like to license the cluster right away, there's a couple bucks.

And I'm considering if I need some faster storage for an NVME tier of CEPH disks installed to these servers.
My testing with a CEPH NVME tier in my homelab gave mixed results, but there were clearly advantages in most configurations I tried.
I sorta need to make that high-speed storage decision soon. Req, buy, and install things takes time.
(Feel free to yell at me right now and tell me to go buy it. That might tip the scales. But this isn't fantasy-infra where I get to order all-NVME arrays.)
 
Thanks for the feedback. I understand your point.

I get 2 nodes. Might run an observer on another cluster. The initial guest list will be dev/qa stuff. It does need to be tolerable, but it doesn't need to be smokin fast. We will get more redundancy later if we can stand this up and use it to take load off other future cluster members.

These are very performant machines. Massive, beastly things with huge procs, tons of ram, and large ssd arrays. Maybe CEPH will run lousy regardless, but I have to at least try (and fail) with these two nodes before I ask for more. I've already asked for quite a bit. I think they'd like to see me deliver before I ask for more.
 
I get 2 nodes.
Well, Ceph is out of the question then.
These are very performant machines. Massive, beastly things with huge procs, tons of ram, and large ssd arrays.
None of that matters when there is a hardware issue with one of them and you don't have other nodes for redundancy.
We are getting serious now. The boss wants "Proxmox VSAN". I sold it. Now I gotta deliver.
It does not sound like you are, sorry.
There's so much to plan for, but a lot of that comes later.
No, planning comes first and you are already committed to two expensive nodes before you even investigated whether it would meet the minimal requirements.
 
  • Like
Reactions: mrwizardno2
3 nodes is really a minimum. We run 4 using all SSD for VM's and sas spinners with DB/WAL disks for an EC pool for cold storage. Ceph is great and performs well for our needs, but you really need a 3 node cluster.
 
  • Like
Reactions: tcabernoch
It does not sound like you are, sorry.
:] I can deliver.
you are already committed to two expensive nodes before you even investigated
Such testy old linux guy stuff. Classic.

Ok guys. Got your point. I see the immediate challenges in front of me.

Edit ... Updated my plan. There's a couple newer all SSD hosts already running PVE that I can incorporate. I hadn't planned on doing CEPH with them, but they could fill this role. Thanks for the feedback.
 
Last edited:
step 1. remove all non boot drives from your R840s. retain those for compute. examine your existing network topology as you will likely want/need to upgrade it.
step 2. buy 3 smaller and cheaper nodes. populate with at least 4 nics each. the fatter the better.
step 3. repopulate new nodes with your existing drives.
step 4. buy more drives because boss was assuming he'd get MUCH more usable capacity from the existing ones.
step 5. buy new switches when you realize your interfaces were clobbering each other the first time you have any drive channel issues, and your vms were all freezing. reconfigure all your networks.
Such testy old linux guy stuff. Classic.
There is no substitute for experience. I would advise you to use others pain instead of your own.
 
yeah networking is a big overlook when first implementing. Ceph really needs it's own dedicated 10G network at minimum. Also depending on your needs for storage, keep in mind that replicated pools on a 3 node cluster result in roughly 33% storage efficiency. Is your goal uptime, raw capacity, speed, etc. It took us about 6 months to really dial in what we needed, test, and implement...and that's on a small cluster!
 
  • Like
Reactions: tcabernoch
Proxmox has simplified the use of Ceph by integrating it into the PVE control panel and making it a core component of the PVE infrastructure.

If you were considering running Ceph outside of PVE, be aware that while it is feasible, the learning curve is significantly steeper. If you're working under tight deadlines, you might want to collaborate with a partner who has expertise in Ceph.

Good luck!


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: tcabernoch
For proof-of-concept, 3-nodes will suffice.

For production, you really, really want a minimum of 5-nodes. That way, 2-nodes can fail and still have 3-nodes for quorum.

I converted a fleet of 13th-gen Dells which used to run VMware vSphere over to Proxmox Ceph.

Made sure all the nodes had the same storage, RAM, CPU, memory, networking, and latest firmware. Also used a true IT-mode storage controller, Dell HBA330.

Zero issues besides the typical SAS HDD dying and needing replacing. This previous VMware cluster never had flash storage to begin with. I just made sure the SAS HDDs had write cache enabled. Not hurting for IOPS. More spindles = more IOPS.
 
Last edited:
That's all a lot of excellent feedback, some of it quite detailed. Thank you all.

I'm running a couple VMware VSAN clusters in one datacenter, so I've got some experience with what it really takes to run clustered storage.
To be honest, our smallest VSAN cluster is 4 node, and the smallest CEPH cluster I tried in the lab was 3 node. Assuming I could do 2 nodes was a stretch.

Network considerations ... I would much rather do this with 100gb. What I have is 10gb fiber w/LACP ... so not-quite-20gb? We already have vlans broken out for vmotion and vsan. I thought I had enough vlans till I met corosync, maybe that needs its own. It sounds like I should work with the netadmin to monitor possible saturation. This scares me.

Storage shrinkage ... Understood. Hadn't really considered it, because both machines have a large amount of space, but that space is going to get sliced way down, depending on redundancy and node count. The additional two I decided to add have less storage. They are going to be less useful, might have to rip out the existing drives. Ok, I need to get thinking about that and consider purchase requests.

Proxmox CEPH - Yes, I would absolutely use the native implementation. I want to eventually do CEPH replication between datacenters. We don't have a replacement for realtime DR replication with vmware/zerto.

"write cache enabled." - No, really? In order to build ZFS everywhere and ditch the RAID controller, we are doing single-disk-raid-0 with noreadahead-nocache, so the end result is bare, unbuffered drives. I'd have to trade notes with you for a while to understand your setup.

There is no substitute for experience
I agree, and I think we should highlight it when experience is shared in a positive fashion. Look at your own post. Nice, helpful, lotsa details. You are a model forum user. Thank you for your insight.
 
...

"write cache enabled." - No, really? In order to build ZFS everywhere and ditch the RAID controller, we are doing single-disk-raid-0 with noreadahead-nocache, so the end result is bare, unbuffered drives. I'd have to trade notes with you for a while to understand your setup.

....

Yeah, really. By default, SAS HDDs ship with write cache disabled. The reason being that these drives were meant to be used on a HW RAID controller with BBU. The HW RAID will do the caching on behalf of of the drives.

So, when I converted the Dells over to Proxmox Ceph and replaced the HW RAID with HBA330, I was wondering why IOPS was terrible. Did1 some troubleshooting and figured out it was because cache was turned off on these SAS HDDs. I've had bad performance with SATA dtives with cache off before, so that gave me the idea it may have to do with the cache not enabled. Once I enabled the write cache on the SAS HDD, the IOPS were much, much better.

I use the following optimizations learned through trial-and-error. YMMV.

Set SAS HDD Write Cache Enable (WCE) (sdparm -s WCE=1 -S /dev/sd[x])
Set VM Disk Cache to None if clustered, Writeback if standalone
Set VM Disk controller to VirtIO-Single SCSI controller and enable IO Thread & Discard option
Set VM CPU Type to 'Host'
Set VM CPU NUMA on servers with 2 more physical CPU sockets
Set VM Networking VirtIO Multiqueue to number of Cores/vCPUs
Set VM Qemu-Guest-Agent software installed
Set VM IO Scheduler to none/noop on Linux
Set Ceph RBD pool to use 'krbd' option
 
Last edited:
Network considerations ... I would much rather do this with 100gb.
100gb is great, but bandwidth is one consideration; contention is the real enemy especially when there is a ceph rebalance storm. 1 4x25 setup will be more resilient and more dependable then 1x100. the general gist of what you want here (edit- AT MINIMUM; other networks are probably desirable as well, eg management) is as follows (vlans)
v0- vm guest network
v1- corosync ring 1
v2- corosync ring 2
v3- ceph public
v4- ceph private

you CAN combine any combination of the above to comingle interfaces/vlans, but the more you insulate those from each other the more resilient your overall system will be.
 
Last edited:
Wow. This is stellar feedback, people. I see some of the headaches and engineering in my future.

I'm particularly concerned about the disk tuning questions. I've been delving deep into storage stuff lately. Looks like its gonna get deeper. I'll carefully review this. Might reach out to jdancer again later.

And the network. Boy I wish I had 100gb.
Regarding redundant corosync, I've seen vmotion take out a cluster. It was quite a mess. Now, I have a discrete vmotion vlan. I can have my netadmin throttle it if needed.
I'm running 2 logical and one physical corosync 'networks'. I think its a mistake to say that redundant vlans on the same physical network interface is not a useful configuration. It is simply not more _physically_ redundant. If you allow for the quirks of the local network, it may make every bit of sense. In my own case, we have an 'old' 10gb vlan and a 'new' one. Eventually the old one goes away. The new one will already be in place, and obviously it _shouldn't_ come into play, but if say a server IP gets stepped on, it will fallback to its redundant IP.

Thanks again.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!