I currently have a 4 node HCI cluster that's working quite well. It will be expanding to 8 nodes total and be used for critical services. All of the testing was satisfactory and management was duly impressed. I am reinstalling the cluster from scratch in order to ensure none of the testing bits are hanging around and to produce documentation for our environment. Each node has:
2x 120GB SSDs in ZFS RAID 1 for the OS
1x 480 GB SSD for OSD (We will be moving to 2x or 3x 1TB SSDs)
2X Intel X710 2 Port 10GB NICs
- Card 1 Port 1/Card 2 Port 1, 802.3ad, bond0, used for vmbr0, 10.201.0.0/23
- Card 1 Port 2/Card 2 Port 2, 802.3ad, bond1, 192.168.21.0/24
1X Intel I350 (onboard) 2 port 1 GB NIC
- Both ports ALB bond, bond1, 192.168.20.0/4
I wish I could get separate drives for WAL and DB but that's not in the cards, I've already pushed and been told "no" so I have to live with what I have there.
I am trying to piece together what is the best networking setup. We need VMs/Containers to migrate if they can't be reached on the 10.201.0.0 network. I had split Corosync out to the 192.168.20.0 network but that doesn't trigger migration if the 10.201.0.0 network goes down and for obvious reasons. I also want to optimize the Ceph networking as well. Reinstalling, as is required for the project, gives me the chance to get everything right and ensure I am following best practices.
My initial thoughts are:
2x 120GB SSDs in ZFS RAID 1 for the OS
1x 480 GB SSD for OSD (We will be moving to 2x or 3x 1TB SSDs)
2X Intel X710 2 Port 10GB NICs
- Card 1 Port 1/Card 2 Port 1, 802.3ad, bond0, used for vmbr0, 10.201.0.0/23
- Card 1 Port 2/Card 2 Port 2, 802.3ad, bond1, 192.168.21.0/24
1X Intel I350 (onboard) 2 port 1 GB NIC
- Both ports ALB bond, bond1, 192.168.20.0/4
I wish I could get separate drives for WAL and DB but that's not in the cards, I've already pushed and been told "no" so I have to live with what I have there.
I am trying to piece together what is the best networking setup. We need VMs/Containers to migrate if they can't be reached on the 10.201.0.0 network. I had split Corosync out to the 192.168.20.0 network but that doesn't trigger migration if the 10.201.0.0 network goes down and for obvious reasons. I also want to optimize the Ceph networking as well. Reinstalling, as is required for the project, gives me the chance to get everything right and ensure I am following best practices.
My initial thoughts are:
- Leave vmbr0 configured as it is now with the 802.3ad bond to provide ingress and egress for the VMs via the 10.201.0.0 network
- Configure Corosync to run on the 10.201.0.0 network as that's the most important network as the VMs must remain available
- Configure the Ceph public network on the 192.168.21.0 network and not define a cluster network.
- Ignore the 1GB network for now