PVE&distributed storage network design/segmentation/configuration advise

Roska · Jun 30, 2020

Hello,

I am planning to take advantage of the slow summer months and setup a lab environment using PVE and either glusterfs or ceph to evaluate them as an option for running things on top of native linux and openio like we currently do at the company where I work at. I have done a fair bit of homework studying the different options and best practices however optimal network design/segmentation/configuration for a cluster such as this is something I haven't quite managed to wrap my head around yet.

Cluster will consist of following hardware:

ToR:
Nexus 3064-X (supports mlag)

Storage Server:
4x 10Gb NIC connected directly to ToR with DAC cables
6-8 drives + cache SSD for distributed storage

Compute Server:
HP BL460s blades in C7000 chassis
2x 10Gb NIC on each blade
2X virtual connect module with 2x 10Gb uplinks to ToR in each chassis for a total of 4 uplinks per chassis.
No local storage for VMs

Both compute and storage server will be running PVE and the later will also run distributed storage platform for the VMs that are running in the cluster. Most of the VM will be running on the compute servers unless there is some special reason for moving the workload to the storage server. Like DB VM benefiting from local disk access or VM with high cpu or memory requirements demanding for a larger form factor server.

I am fully aware that having only two NICs on the compute servers and four uplinks in each chassis to ToR will severely limit my options when it comes to network design. Unfortunately it's not possible to change this at this stage but I will most definitely fix the situation if this will be put to production at some point.

As my previous type 1 hypervisor experience is limited to vSphere/vSan I am quite confused about how I should:

Segment the traffic into different networks?
How I should spread the different networks between the 2 NICs on compute or 4 NICs on storage servers?
What type of interface configuration (access port, trunk, mlag) should I use for the NICs?
Any additional steps (QoS?) I should take to prevent possible issues with the compute blades? All of the servers are next to each other and ToR switches are of the low latency X model so bandwidth should be the only issue.
Given the high amount compute servers I was planning on using OVS. Any thoughts on this vs the linux alternative?

Below is list of networks I planned on distributing the different types of traffic to and how I planned to assign them to the different NICs. Any feedback/real world experiences on this would be highly appreciated!

Networks:
VM private
VM DMZ
Storage front
Storage back
PVE management
PVE cluster/cronsync network

Configuration on Compute:

NIC1: Trunk
Storage Front
VM private
VM DMZ
PVE management

NIC2: Access
PVE cluster/cronsync network

Configuration on Storage:

NIC1:
Storage front

NIC2:
PVE cluster/cronsync network

NIC3:
Storage back
PVE management

NIC4:
Storage back
VM Private
VM DMZ

aaron · Jun 30, 2020

There is a lot to address here

Ideally, you would have quite a few more NICs available, but this is a lab setup and nothing close resembling production. Just don't extrapolate issues or subpar performance from this to a properly set up production environment

Do you have more than one Storage server or only one?

When trying out Ceph, you would want at least 3 Storage nodes. The whole idea behind Ceph is to have it running on each node and thus when more storage and computing power is needed, you just add another node to the cluster. At a certain size it might become a good idea to separate Ceph storage nodes from compute nodes with the latter only accessing Ceph as clients.

In a proper production network you would have:

2x separate: Corosync network, doesn't need to be fast, 1Gbit is enough, but shouldn't share the physical link with other services to avoid congestion
2x bonded: Ceph public network
2x bonded (optional): Ceph cluster network, can take some traffic of the Ceph public network. For more infos see https://docs.ceph.com/docs/nautilus/rados/configuration/network-config-ref/

Then the following bonded in some way or another depending on the availability requirements:

Migration network (to transfer a VMs state quickly to another node in a live migration)
Backup network, so the backup traffic doesn't interfere with anything else.
Production network, facing outwards.

Roska said:
Segment the traffic into different networks?

Looks okay in my opinion. You did put corosync on it's own NIC

Depending on the Storage you use, `Storage Front/Back` might be easier or not so easy to separate that easily. Having storage with everything else on the same NIC might cause some performance problems though. But since you cannot add more NICs (one reason why blade servers aren't the best fit for a PVE Ceph cluster) you might be able to work around that by placing other services on the Corosync NIC and use QOS to make sure the corosync packets will have priority. Nothing I would recommend for production though

Roska said:
How I should spread the different networks between the 2 NICs on compute or 4 NICs on storage servers?

I think the previous answer is valid here?

Roska said:
What type of interface configuration (access port, trunk, mlag) should I use for the NICs?

I guess you don't have enough NICs for MLAG? So I would say VLANs and be good? It's a LAB setting anyway.

Roska said:
Any additional steps (QoS?) I should take to prevent possible issues with the compute blades? All of the servers are next to each other and ToR switches are of the low latency X model so bandwidth should be the only issue.

As mentioned earlier, due to the low amount of NICs you might want to use QOS for corosync if you have other services running on the same NIC

Roska said:
Given the high amount compute servers I was planning on using OVS. Any thoughts on this vs the linux alternative?

Unless you need something very specific that the native Linux Bridge cannot deliver, stay away from yet another layer of complexity

For most setups a native bridge should suffice.

As a last hint, I would suggest you to start small when getting to know PVE. The setup you describe here seems to be quite advanced with the needed workarounds and possibly more complicated network setup and thus you might run into quite a few issues along the way.

Roska · Jun 30, 2020

Thank you for your reply Aaron!

It's indeed just a lab environment but I have grown into a habit of checking anyway as sometimes it's the difference between making something work and completely wasting your time.

aaron said:
Do you have more than one Storage server or only one?

I have 10, I might add some more later on if my back ever recovers from putting these into the rack. One thing I learned from my adventure with OpenIO is that you can mitigate many issues by adding more nodes.

aaron said:
Looks okay in my opinion. You did put corosync on it's own NIC Depending on the Storage you use, `Storage Front/Back` might be easier or not so easy to separate that easily. Having storage with everything else on the same NIC might cause some performance problems though. But since you cannot add more NICs (one reason why blade servers aren't the best fit for a PVE Ceph cluster) you might be able to work around that by placing other services on the Corosync NIC and use QOS to make sure the corosync packets will have priority. Nothing I would recommend for production though

I guess I should start by clarifying that I meant public and cluster networks in "ceph language" and what ever their counterparts are called in glusterfs with storage front/back.

It's actually the type of blade and their density that are causing a problem here. I was hoping to use a bit different model with larger formfactor and 4-12 onboard ports but all I got was the tiny things with just 2 ports. Ah well.. "it's just a lab"

I guess all the pdf's and kindle ebook purchase paid off in the planning of segmentation!

aaron said:
I think the previous answer is valid here?

I am afraid not. Storage server have two dual port NICs each for total of 4 ports per server. Since it's also the place where most of the network load is going to be I would very much appreciate any opinions/tips on how to set these up as they will have all of the "problematic" networks (public, cluster and cronsync) connected to them unlike the compute servers.

aaron said:
I guess you don't have enough NICs for MLAG? So I would say VLANs and be good? It's a LAB setting anyway.

Like I pointed out above I have 4 ports available on the storage servers and a lot of problems to pass through those 4 ports.

aaron said:
Unless you need something very specific that the native Linux Bridge cannot deliver, stay away from yet another layer of complexity
For most setups a native bridge should suffice.

Since it's a lab I can deal with complicated as I can safely break things in here. I am more worried about management as we are talking about a environment of 60 physical servers. OVS has some pretty nice integration APIs available. I honestly don't know what are the alternatives for linux bridge. Ansible?

aaron said:
As a last hint, I would suggest you to start small when getting to know PVE. The setup you describe here seems to be quite advanced with the needed workarounds and possibly more complicated network setup and thus you might run into quite a few issues along the way.

I hear you! Unfortunately as this is half work half hobby type of project I need to make it complicated enough to make my findings somewhat applicable to production environment. Luckily it's just 60 servers and no L3 networking... for now.

I

Search

Search

PVE&distributed storage network design/segmentation/configuration advise

Roska

Member

aaron

Proxmox Staff Member

Roska

Member

We value your privacy