Help planning/creating 2 cluster of 3 node each (Compute & Ceph)

croquemox

New Member
Sep 6, 2024
4
0
1
I need an infrastructure to create and provide vms / containers and virtualized infrastructures (Active Directories etc)
I've little knowledge about IT infrastructure and spent days researching stuff that lead me to create this diagram.
Am I on the right path for the general idea, is it feasible or do you see a glaring mistake ?
Thanks!
Untitled-2024-09-06-1153.png
 
Last edited:
I can't really help you to much with Ceph of the fibre-channel part, but for the ethernet/internet/server side a few things I do notice are the following:
  • You have Router and PFSense as two separate blocks, but PFSense itself is a router, so you might want to combine it into one, unless the router should be the PFSense, and where it now say s PFSense should be modem / media-converter? Also, maybe take a look at OPNsense instead of PFSense, they have similar roots, but are set up a bit differently, I personally prefer OPNSense's setup more, but maybe take a look at both and see what you like more.
  • One other point, partly related to this, is that you're showing this as having just 1 internet-connection, unless your ISP is providing you with 2 "paths" for redundancy? If you really want everything redundant, internet is something to look to as well, and for that you also need to decide if you just want outgoing-traffic to be safe (then a simple failover internet-line from a different provider, brand, or perhaps even wireless (4G-modem / starlink) can be an option), or if your incoming traffic also need to be guaranteed up (then you'll need to your ISP's about the options for BGP-Routing or similar setups)
  • Finally, those "black" lines between the nodes, are those also network-cables? Are you perhaps thinking those for the corosync / cluster network/traffic? If not, you might want to look into that. For a stable cluster it is strongly adviced to have 1 dedicated port per server for JUST the corsync traffic. Doesn't have to be anything fancy, a 1GB ethernet-port is fine (the main issue to look for is latency, low, stable latency is a must for a stable cluster), just make sure that no other traffic is running across this, and add other ports as fallback for the corosync-traffic.
 
I can't really help you to much with Ceph of the fibre-channel part, but for the ethernet/internet/server side a few things I do notice are the following:
  • You have Router and PFSense as two separate blocks, but PFSense itself is a router, so you might want to combine it into one, unless the router should be the PFSense, and where it now say s PFSense should be modem / media-converter? Also, maybe take a look at OPNsense instead of PFSense, they have similar roots, but are set up a bit differently, I personally prefer OPNSense's setup more, but maybe take a look at both and see what you like more.
  • One other point, partly related to this, is that you're showing this as having just 1 internet-connection, unless your ISP is providing you with 2 "paths" for redundancy? If you really want everything redundant, internet is something to look to as well, and for that you also need to decide if you just want outgoing-traffic to be safe (then a simple failover internet-line from a different provider, brand, or perhaps even wireless (4G-modem / starlink) can be an option), or if your incoming traffic also need to be guaranteed up (then you'll need to your ISP's about the options for BGP-Routing or similar setups)
  • Finally, those "black" lines between the nodes, are those also network-cables? Are you perhaps thinking those for the corosync / cluster network/traffic? If not, you might want to look into that. For a stable cluster it is strongly adviced to have 1 dedicated port per server for JUST the corsync traffic. Doesn't have to be anything fancy, a 1GB ethernet-port is fine (the main issue to look for is latency, low, stable latency is a must for a stable cluster), just make sure that no other traffic is running across this, and add other ports as fallback for the corosync-traffic.
I'm taking notes for OPNSense and will look into it. You're right i'll be better off merging those two, and handle routing through PFSense / OPNSense.
Unfortunately I don't have a say in the internet failover/redundancy policy/budget, as much as i'd love to :) But thanks for the ressources i'll check them out !
The black wires are straight from the meshed network cluster architecture from proxmox with dedicated ports per server. I did not know about the specifics of this corosync connection and will absolutely use it because it sounds really good.

Would you know by any chance what tools are available to measure storage controllers requirements, at first I thought they were big hard drives, but turns out you need to drive them and supervise their integrity / health / redundancy / raid logic kind of like a storage supervisor ?
Thanks a lot for the feedback !
 
Like I said, not played around with Ceph enough to give any advice in that I'm confident enough of. There are quite a couple of people who are much more familiar in the hardware-/storage-space then I am, for example bbgeek/blockbridge.

Also since it is just a default and not the full plan, I personally prefer to have it all just run to a switch instead of making it a mesh network. Again this switch doesn't have to be top-of-the-line, but it does need to be "good" (reputable company, preferably with the option for 2 power-supplies, options for vlan's is nice for future-proofing)

And totally get it that you wouldn't have a say about everything, but there is probably always room in your report/plan for "nice to haves" and/or "things to consider". Even if you can't control it directly, throwing the idea out there can have people that do have that power think about it and/or ask you to investigate the options for that too. You wouldn't have to fully think it all out, but a general note like "at the moment we have most parts of our organisation with at least 1 fall-back option, the only exception being the internet-access.

Speaking of double-setup btw. One thing you COULD double-setup still, would be the router itself. With both OPNSense and PFSense you have the ability to set up HA between the two, so that if one goes down (for example for maintenance but also because of failure) the other will take over. You can have 1 router per switch, or even have it cross-wired there as well. All depends on how redundant you want to be, but if you're working with VLAN's that run through the router, it is something to think about at least.
 
Like I said, not played around with Ceph enough to give any advice in that I'm confident enough of. There are quite a couple of people who are much more familiar in the hardware-/storage-space then I am, for example bbgeek/blockbridge.

Also since it is just a default and not the full plan, I personally prefer to have it all just run to a switch instead of making it a mesh network. Again this switch doesn't have to be top-of-the-line, but it does need to be "good" (reputable company, preferably with the option for 2 power-supplies, options for vlan's is nice for future-proofing)

And totally get it that you wouldn't have a say about everything, but there is probably always room in your report/plan for "nice to haves" and/or "things to consider". Even if you can't control it directly, throwing the idea out there can have people that do have that power think about it and/or ask you to investigate the options for that too. You wouldn't have to fully think it all out, but a general note like "at the moment we have most parts of our organisation with at least 1 fall-back option, the only exception being the internet-access.

Speaking of double-setup btw. One thing you COULD double-setup still, would be the router itself. With both OPNSense and PFSense you have the ability to set up HA between the two, so that if one goes down (for example for maintenance but also because of failure) the other will take over. You can have 1 router per switch, or even have it cross-wired there as well. All depends on how redundant you want to be, but if you're working with VLAN's that run through the router, it is something to think about at least.
Got it, thought you were refering to Ceph specifically. I'll pitch the idea of "things to consider" for failover solutions :)
I re-made a diagram with firewall HA in mind, but I did not get your point about VLANs and "1 router per switch" (were you talking about 1 router per PF/OPNSense layer above ?). I'm planning to distribute VLANs dynamically in Proxmox for the end user and their VMs, and I might have to create a management layer to handle both PFSense and Proxmox remotely (if I understand correctly). I think I don't understand how redundancy and VLANs correlate to one another in your hint ?
Thanks again :)
 

Attachments

  • Untitled-2024-09-06-1153(1).png
    Untitled-2024-09-06-1153(1).png
    175 KB · Views: 6
Ignore the 1 router per switch part, just rambling there.

I personally prefer to set all "hardware" (So the switches, proxmox, the router's management and the like) to the untagged VLAN, and then all other things to tagged, but having everything tagged is an option too (Just got to watch out that you don't create a QinQ, or a VLAN within a VLAN, when you don't intend to).
If you have things from one VLAN that needs to go to another VLAN though, you'll need some kind of routing in-between the two. For example if you're running a monitoring-server, it will need to reach the hardware in untagged, but also VM's in different tagged VLANs, to facilitate the/a router needs to be active. With the setup you have now, if one of the routers goes down, inter-VLAN traffic will remain working, other then a blip while it's switching over.

Also, if you're planning to use many VLANs, also take a look at the SDN-feature within Proxmox. Even if you don't use the ip-defining feature of it, just having the option to easily select a port with the correct VLAN setup and a description who/what it is for at the ready can be very useful.

Drawing looks good to me at the "blue" side btw.
 
Ignore the 1 router per switch part, just rambling there.

I personally prefer to set all "hardware" (So the switches, proxmox, the router's management and the like) to the untagged VLAN, and then all other things to tagged, but having everything tagged is an option too (Just got to watch out that you don't create a QinQ, or a VLAN within a VLAN, when you don't intend to).
If you have things from one VLAN that needs to go to another VLAN though, you'll need some kind of routing in-between the two. For example if you're running a monitoring-server, it will need to reach the hardware in untagged, but also VM's in different tagged VLANs, to facilitate the/a router needs to be active. With the setup you have now, if one of the routers goes down, inter-VLAN traffic will remain working, other then a blip while it's switching over.

Also, if you're planning to use many VLANs, also take a look at the SDN-feature within Proxmox. Even if you don't use the ip-defining feature of it, just having the option to easily select a port with the correct VLAN setup and a description who/what it is for at the ready can be very useful.

Drawing looks good to me at the "blue" side btw.
Thanks for everything ! VLAN-wise I'm still not solid enough on everything you've brought up, but I soon will try to be, I'll check SDN too
Now I just have to elucidate the storage part of things and I'll have enough elements to start looking for hardware. Thanks again !
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!