3-node Cluster with Ceph, sanity check?

tycoonbob · Jul 18, 2017

Hi all.

I've been a Proxmox user since v2.x, but am about to roll out a new lab environment using PVE 5.0 and Ceph, to do some testing with. Hardware-wise, I've got 3 nodes spec'd out like this:
Supermicro 2U
dual Xeon E5-2670's
192GB RAM
LSI 9207-8i HBA
Intel X520-DA2
x2 240GB SanDisk SSD PLUS drives
x2 960GB SanDisk Ultra II SSD's
x1 Samsung SM961 256GB NVMe M.2 SSD (in PCIe adapter)

For a 3-node PVE cluster, this hardware will do great. For Ceph, it should do great as well (for monitor nodes). My current plan is to use the 2 240GB SSD's in a software RAID 1 for the OS (ZFS, maybe?), the 2 960GB SSD's as OSD's, and the 256GB NVMe drive for journals. This SSD-only pool would be used ONLY for LXC and KVM instances. I know consumer/prosumer SSD's aren't ideal, but I don't think my IO load will be so horrible to kill these drives in less than 3 years, especially with a dedicated journal device.

So first off, does my plan seem sane? Would 3/2 or 2/1 be recommended from a performance perspective? Available storage space-wise, either should be fine...just didn't know if a replica of 3 would be advisable on a 3-node cluster (that could potentially grow to 4-nodes in the future, but never beyond 4-nodes). Replica of 3/2 would still yield somewhere around 1.8TB usable, which will likely be plenty of space. However, I have the ability to add 2 more 960GB Ultra II's to each node (so 4 total 960GB Ultra II SSD OSD's per node, 12 total in the cluster). Would having 4 per node give me a noticeable performance increase, or would 2 SSD OSD's per node give me enough performance to handle a lab environment? I would say the most I/O intensive task that will be in this cluster would be syslog-ng and a clustered Splunk environment. Going with 12 total 960GB SSD OSD's would be over-budget, but doable if needed.

Secondly, I'm also considering plans for a second Ceph pool using Seagate 5TB 2.5" drives (5400RPM, slow, consumer-grade) since these servers have 24 2.5" bays each. Starting with 4 drives per node, and expanding up to ~16 drives per node (48 5TB drives total). If I was to do this, would I be able to use the SAME 256GB NVMe drive for a journal? Meaning, both Ceph pools would use the same journal device. I would say 20 max OSD's per node, so I figure 10GB journal per drive. This sound doable? I'm trying to plan ahead for max plans here. This second Ceph pool would be used for WORM (Write Once, Read Many) data, such as LXC/KVM backups, user drives, etc. Would also think 3/2 would be appropriate for this pool.

Network-wise, each node would have 2 10gbit links (SFP+ DAC cables); 1 dedicated for Ceph, and 1 dedicated for LXC/KVM use. Each node would also have 2 1GbE links in a LAG for LAN management (UI access, updates, etc) as well as Corosync communication.

Anything additional I didn't think to cover, or does this sound like a solid environment (for a lab, considering the non-enterprise storage)?

Thanks!!

xtavras · Jul 19, 2017

first of all, please post it to ceph's mailing list too, http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com/ you will get much greater response.

- not sure about performance, but 3/2 is default these days and it's not coincidence, 3/1 is not recommended anymore.
- 6 OSDs with SSD are perfectly fine for lab stuff I think, don't forget to have pair cold spares around. consumer SSD can die pretty fast (I even had Intel 3710 dying after 2 days)
- I wouldn' use 5400RPM personally, try to get 7200RPM if it's possible for 2.5'', don't forget that SSD/NVMe journaling only helps for write performance, but not for reading.
- for other points, looks good!

tycoonbob · Jul 25, 2017

So to follow up here (sorry for the delay), I have my new cluster setup and has been running for ~5 days now, quite well might I add. I currently have 3 960GB SanDisk Ultra II SSD's per cluster member running as OSD's, as well as each cluster member having it's own Samsung SM961 256GB NVMe SSD, with 10GB partitions set as journals for each existing OSD's. I plan to grown to 4 960GB SSD's per node, just to ensure plenty of growing room for future use, but it will be a little bit before adding those.

So with that said, my final question is how do I setup new OSD's that are completely separate from the current all-flash pool with my current SSD's. Like, I have 24-bays across 3 nodes, and I'm using 6 currently (2 for OS drives, 3 OSD's, 1 reserved for future flash OSD), so I've got 18-bays free, per node. If I wanted to fill all of those up with 2.5" spinners, but wanted to keep those OSD's separate from my current pool, how do I do that? Is it possible?

Gerhard W. Recher · Jul 25, 2017

tycoonbob said:
So to follow up here (sorry for the delay), I have my new cluster setup and has been running for ~5 days now, quite well might I add. I currently have 3 960GB SanDisk Ultra II SSD's per cluster member running as OSD's, as well as each cluster member having it's own Samsung SM961 256GB NVMe SSD, with 10GB partitions set as journals for each existing OSD's. I plan to grown to 4 960GB SSD's per node, just to ensure plenty of growing room for future use, but it will be a little bit before adding those.

So with that said, my final question is how do I setup new OSD's that are completely separate from the current all-flash pool with my current SSD's. Like, I have 24-bays across 3 nodes, and I'm using 6 currently (2 for OS drives, 3 OSD's, 1 reserved for future flash OSD), so I've got 18-bays free, per node. If I wanted to fill all of those up with 2.5" spinners, but wanted to keep those OSD's separate from my current pool, how do I do that? Is it possible?

only my manually compiling the crush map for my knowledge ... gui has no feature to accomplish this

good luck !

Magneto · Aug 6, 2017

/following with interest.

Search

Search

3-node Cluster with Ceph, sanity check?

tycoonbob

Member

xtavras

Renowned Member

tycoonbob

Member

Gerhard W. Recher

Well-Known Member

Magneto

Well-Known Member

We value your privacy