They asked me for a CEPH deployment plan!

tcabernoch · Jun 16, 2024

We are getting serious now. The boss wants "Proxmox VSAN". I sold it. Now I gotta deliver.

Gonna take two of the big all-SSD R840 hosts out of production and use em for a new CEPH cluster.
It's not happening today, by any means. I'm gonna build it out in the lab. Demo it. _Test_ it.
When I've got a working model, we can consider moving customers around to clear the two hosts.

There's so much to plan for, but a lot of that comes later.
Other than the massive commitment of existing hardware to a pilot program, I'm considering any expenses.
I sure would like to license the cluster right away, there's a couple bucks.

And I'm considering if I need some faster storage for an NVME tier of CEPH disks installed to these servers.
My testing with a CEPH NVME tier in my homelab gave mixed results, but there were clearly advantages in most configurations I tried.
I sorta need to make that high-speed storage decision soon. Req, buy, and install things takes time.
(Feel free to yell at me right now and tell me to go buy it. That might tip the scales. But this isn't fantasy-infra where I get to order all-NVME arrays.)

gurubert · Jun 17, 2024

tcabernoch said:
take two of the big all-SSD R840 hosts

You need at least three nodes, better go with five for realistic performance.

tcabernoch · Jun 17, 2024

Thanks for the feedback. I understand your point.

I get 2 nodes. Might run an observer on another cluster. The initial guest list will be dev/qa stuff. It does need to be tolerable, but it doesn't need to be smokin fast. We will get more redundancy later if we can stand this up and use it to take load off other future cluster members.

These are very performant machines. Massive, beastly things with huge procs, tons of ram, and large ssd arrays. Maybe CEPH will run lousy regardless, but I have to at least try (and fail) with these two nodes before I ask for more. I've already asked for quite a bit. I think they'd like to see me deliver before I ask for more.

leesteken · Jun 17, 2024

tcabernoch said:
I get 2 nodes.

Well, Ceph is out of the question then.

tcabernoch said:
These are very performant machines. Massive, beastly things with huge procs, tons of ram, and large ssd arrays.

None of that matters when there is a hardware issue with one of them and you don't have other nodes for redundancy.

tcabernoch said:
We are getting serious now. The boss wants "Proxmox VSAN". I sold it. Now I gotta deliver.

It does not sound like you are, sorry.

tcabernoch said:
There's so much to plan for, but a lot of that comes later.

No, planning comes first and you are already committed to two expensive nodes before you even investigated whether it would meet the minimal requirements.

dtom · Jun 17, 2024

3 nodes is really a minimum. We run 4 using all SSD for VM's and sas spinners with DB/WAL disks for an EC pool for cold storage. Ceph is great and performs well for our needs, but you really need a 3 node cluster.

tcabernoch · Jun 17, 2024

leesteken said:
It does not sound like you are, sorry.

:] I can deliver.

leesteken said:
you are already committed to two expensive nodes before you even investigated

Such testy old linux guy stuff. Classic.

Ok guys. Got your point. I see the immediate challenges in front of me.

Edit ... Updated my plan. There's a couple newer all SSD hosts already running PVE that I can incorporate. I hadn't planned on doing CEPH with them, but they could fill this role. Thanks for the feedback.

alexskysilk · Jun 17, 2024

step 1. remove all non boot drives from your R840s. retain those for compute. examine your existing network topology as you will likely want/need to upgrade it.
step 2. buy 3 smaller and cheaper nodes. populate with at least 4 nics each. the fatter the better.
step 3. repopulate new nodes with your existing drives.
step 4. buy more drives because boss was assuming he'd get MUCH more usable capacity from the existing ones.
step 5. buy new switches when you realize your interfaces were clobbering each other the first time you have any drive channel issues, and your vms were all freezing. reconfigure all your networks.

tcabernoch said:
Such testy old linux guy stuff. Classic.

There is no substitute for experience. I would advise you to use others pain instead of your own.

dtom · Jun 17, 2024

yeah networking is a big overlook when first implementing. Ceph really needs it's own dedicated 10G network at minimum. Also depending on your needs for storage, keep in mind that replicated pools on a 3 node cluster result in roughly 33% storage efficiency. Is your goal uptime, raw capacity, speed, etc. It took us about 6 months to really dial in what we needed, test, and implement...and that's on a small cluster!

bbgeek17 · Jun 17, 2024

Proxmox has simplified the use of Ceph by integrating it into the PVE control panel and making it a core component of the PVE infrastructure.

If you were considering running Ceph outside of PVE, be aware that while it is feasible, the learning curve is significantly steeper. If you're working under tight deadlines, you might want to collaborate with a partner who has expertise in Ceph.

Good luck!

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

gurubert · Jun 17, 2024

If you do not want more than two to three storage nodes, have a look at Linbit's offering regarding DRBD for Proxmox.

jdancer · Jun 17, 2024

For proof-of-concept, 3-nodes will suffice.

For production, you really, really want a minimum of 5-nodes. That way, 2-nodes can fail and still have 3-nodes for quorum.

I converted a fleet of 13th-gen Dells which used to run VMware vSphere over to Proxmox Ceph.

Made sure all the nodes had the same storage, RAM, CPU, memory, networking, and latest firmware. Also used a true IT-mode storage controller, Dell HBA330.

Zero issues besides the typical SAS HDD dying and needing replacing. This previous VMware cluster never had flash storage to begin with. I just made sure the SAS HDDs had write cache enabled. Not hurting for IOPS. More spindles = more IOPS.

tcabernoch · Jun 17, 2024

That's all a lot of excellent feedback, some of it quite detailed. Thank you all.

I'm running a couple VMware VSAN clusters in one datacenter, so I've got some experience with what it really takes to run clustered storage.
To be honest, our smallest VSAN cluster is 4 node, and the smallest CEPH cluster I tried in the lab was 3 node. Assuming I could do 2 nodes was a stretch.

Network considerations ... I would much rather do this with 100gb. What I have is 10gb fiber w/LACP ... so not-quite-20gb? We already have vlans broken out for vmotion and vsan. I thought I had enough vlans till I met corosync, maybe that needs its own. It sounds like I should work with the netadmin to monitor possible saturation. This scares me.

Storage shrinkage ... Understood. Hadn't really considered it, because both machines have a large amount of space, but that space is going to get sliced way down, depending on redundancy and node count. The additional two I decided to add have less storage. They are going to be less useful, might have to rip out the existing drives. Ok, I need to get thinking about that and consider purchase requests.

Proxmox CEPH - Yes, I would absolutely use the native implementation. I want to eventually do CEPH replication between datacenters. We don't have a replacement for realtime DR replication with vmware/zerto.

"write cache enabled." - No, really? In order to build ZFS everywhere and ditch the RAID controller, we are doing single-disk-raid-0 with noreadahead-nocache, so the end result is bare, unbuffered drives. I'd have to trade notes with you for a while to understand your setup.

alexskysilk said:
There is no substitute for experience

I agree, and I think we should highlight it when experience is shared in a positive fashion. Look at your own post. Nice, helpful, lotsa details. You are a model forum user. Thank you for your insight.

jdancer · Jun 18, 2024

tcabernoch said:
...

"write cache enabled." - No, really? In order to build ZFS everywhere and ditch the RAID controller, we are doing single-disk-raid-0 with noreadahead-nocache, so the end result is bare, unbuffered drives. I'd have to trade notes with you for a while to understand your setup.

....

Yeah, really. By default, SAS HDDs ship with write cache disabled. The reason being that these drives were meant to be used on a HW RAID controller with BBU. The HW RAID will do the caching on behalf of of the drives.

So, when I converted the Dells over to Proxmox Ceph and replaced the HW RAID with HBA330, I was wondering why IOPS was terrible. Did1 some troubleshooting and figured out it was because cache was turned off on these SAS HDDs. I've had bad performance with SATA dtives with cache off before, so that gave me the idea it may have to do with the cache not enabled. Once I enabled the write cache on the SAS HDD, the IOPS were much, much better.

I use the following optimizations learned through trial-and-error. YMMV.

Set SAS HDD Write Cache Enable (WCE) (sdparm -s WCE=1 -S /dev/sd[x])
Set VM Disk Cache to None if clustered, Writeback if standalone
Set VM Disk controller to VirtIO-Single SCSI controller and enable IO Thread & Discard option
Set VM CPU Type to 'Host'
Set VM CPU NUMA on servers with 2 more physical CPU sockets
Set VM Networking VirtIO Multiqueue to number of Cores/vCPUs
Set VM Qemu-Guest-Agent software installed
Set VM IO Scheduler to none/noop on Linux
Set Ceph RBD pool to use 'krbd' option

alexskysilk · Jun 18, 2024

tcabernoch said:
Network considerations ... I would much rather do this with 100gb.

100gb is great, but bandwidth is one consideration; contention is the real enemy especially when there is a ceph rebalance storm. 1 4x25 setup will be more resilient and more dependable then 1x100. the general gist of what you want here (edit- AT MINIMUM; other networks are probably desirable as well, eg management) is as follows (vlans)
v0- vm guest network
v1- corosync ring 1
v2- corosync ring 2
v3- ceph public
v4- ceph private

you CAN combine any combination of the above to comingle interfaces/vlans, but the more you insulate those from each other the more resilient your overall system will be.

gurubert · Jun 18, 2024

alexskysilk said:
v1- corosync ring 1
v2- corosync ring 2

If you deploy more than one corosync ring these should really be physically separate networks including the switches. Otherwise it does not make any sense.

tcabernoch · Jun 20, 2024

Wow. This is stellar feedback, people. I see some of the headaches and engineering in my future.

I'm particularly concerned about the disk tuning questions. I've been delving deep into storage stuff lately. Looks like its gonna get deeper. I'll carefully review this. Might reach out to jdancer again later.

And the network. Boy I wish I had 100gb.
Regarding redundant corosync, I've seen vmotion take out a cluster. It was quite a mess. Now, I have a discrete vmotion vlan. I can have my netadmin throttle it if needed.
I'm running 2 logical and one physical corosync 'networks'. I think its a mistake to say that redundant vlans on the same physical network interface is not a useful configuration. It is simply not more _physically_ redundant. If you allow for the quirks of the local network, it may make every bit of sense. In my own case, we have an 'old' 10gb vlan and a 'new' one. Eventually the old one goes away. The new one will already be in place, and obviously it _shouldn't_ come into play, but if say a server IP gets stepped on, it will fallback to its redundant IP.

Thanks again.

Search

Search

They asked me for a CEPH deployment plan!

tcabernoch

Active Member

gurubert

Distinguished Member

tcabernoch

Active Member

leesteken

Distinguished Member

dtom

New Member

tcabernoch

Active Member

alexskysilk

Distinguished Member

dtom

New Member

bbgeek17

Distinguished Member

gurubert

Distinguished Member

jdancer

Renowned Member

tcabernoch

Active Member

jdancer

Renowned Member

alexskysilk

Distinguished Member

gurubert

Distinguished Member

tcabernoch

Active Member