based on corosync limitations around bonding and latency
Corosync can run over a bond, but it is not recommended. Smaller clusters and blade servers regularly use bonded interfaces out of necessity.
Plus issues around multiple IP's per host.
Having multiple IP addresses per host is a standard deployment for a PVE cluster. In fact, it is recommended.
My understanding is that using one hostname for multiple IP's are problematic and corosync has problems with bonded networks in some scenarios.
Each machine will only have one hostname from PVE's perspective. From a DNS perspective, you can have multiple hostnames. Each PVE host would only have one hostname, but can have many IP addresses. Yes, if you can, it is best not to have Corosync over a bonded connection.
Keeping backup and production off the corosync network and switch will address potential latency issues.
Yes, it is recommended that Corosync's primary network be on its own network (and subnet) and have its own switch. If you can, your backup Corosync links on their own networks would be best; however, many deployments do not have the NICs to do that so they will put their backup Corosync links on less ideal interfaces like a shared network or bonded interface.
Would this starting point eliminate most of the pitfalls of a basic cluster? If backups were interferring with production, they could be moved to their own network (or network throttling used).
You have the right idea to separate different types of traffic into other networks.
I think my main question would be, will two hostnames on a node cause any issues?
You only need one hostname. I'm not sure how you would even set up a second one. I believe you are mixing up the purpose of a hostname with the different networks.
In your example, you appear to have either two or three NICs in each host, but you left out vital information to make a well-informed recommendation.
- What speed are the NICs? 1, 10, 25 GbE or something else?
- What kind of traffic do you expect from your workloads? For example, are the VM's domain controllers, file servers, database servers, web servers, or something else?
- You list three switches but only two subnets. Are the switches capable of MLAG?
Here are some general guidelines:
- Management network
- Managing PVE hosts does not consume a significant amount of bandwidth. This is you connecting by SSH or using the web GUI.
- Occasional bursts may happen if you are uploading to your PVE hosts.
- May have migration traffic, see below.
- May have replication traffic, see below.
- May have corosync traffic, see below.
- May have guest traffic, see below.
- May have network storage traffic, see below.
- May have PBS traffic, see below.
- Migration network
- Migration traffic will occur on the management network, unless it is specifically assigned to a different network.
- Migrating between hosts will depend on your storage configuration.
- If using local storage, this will consume bandwidth based on the size of the drives being migrated.
- If using shared storage, only the memory gets migrated. Typically, much smaller bursts.
- Replication network
- Replication traffic will occur on the migration network, unless it is specifically assigned to a different network.
- Corosync networks
- Corosync will run on the management network by default. It is strongly recommended to move it to a dedicated network.
- The primary should be on a dedicated interface and switch. Only needs 1 Gbps; more is not helpful.
- Should have a secondary link on a different interface. Ideally, it would be on a dedicated interface and switch. If not possible, many people will use the management network as their secondary and recognize that if they have an issue with the primary corosync link they need to be careful until the primary has been restored.
- Guest traffic
- Will be on the management network unless you specifically set it up elsewhere.
- Guest traffic is highly variable between deployments. A domain controller will not use much bandwidth, but a busy file server with large engineering files might be a hog.
- On a typical small business three-node cluster, a VLAN trunk on a pair of bonded 1 Gbps interfaces (total 2 Gbps) can often handle the management network and the guest VLANs without issue as long as the other traffic is handled elsewhere.
- Larger and busier clusters typically have a pair of bonded 10 Gbps interfaces for the VLAN trunk that carries the guest traffic. It is not uncommon to run the management network here also. Again, the other traffic needs to have its own interface for this to work well.
- Proxmox Backup Server traffic
- Will also be on the management network (default gateway) unless you move it elsewhere. Not a configuration setting but a network decision.
- For example
- If, PVE management network is on
192.168.1.0/24, and PBS is on 172.16.100.0/24, your backup traffic will leave from 192.168.1.0/24 go through your router and get to 172.16.100.0/24.
- However, if you add an interface to your PVE and PBS hosts on
10.200.200.0/24 in addition to the interfaces above and you configure PVE to use the PBS's 10.200.200.0/24 address, instead of the traffic leaving and 192.168.1.0/24 and going through the router, the PBS backup traffic will leave the 10.200.200.0/24 interface and go directly to the PBS because they are on the same network.
- Network storage traffic (for PVE hosts)
- Also, it will be on the management network unless configured to go elsewhere.
- See the PBS traffic example. If the PVE hosts and network storage have interfaces on the same network, the OS routing stack will prefer those and send the traffic there.
- Ceph network
- You did not ask about it, but it should always be on its own interfaces.
So, in your example, if you only had two interfaces and they are both 1G, your setup is about all you can do. I would configure the migration and replication networks to be on the VM/backup interface to ensure you do not choke the Corosync.
If one of the interfaces is 10G and the other 1G, I would put everything on the 10G, except the Corosync, which would go on the 1G by itself.
You should consider installing more than two interfaces on each host if you plan to run this cluster in production. While shopping, look for switches with MLAG.
So as far as I can tell, PBS does not use the hostname for connecting to PVE and you only configure the IP address in PBS. From that perspective it does not matter if the high throughput NIC's have a hostname or not.
PBS does not connect to PVE. PVE connects to PBS. When PVE connects to PBS, it uses the routing table to determine which interface to use.
Assuming your management interface has the default gateway and your PBS is on a different subnet, the traffic will be routed through the management network unless a better route is available. Above, I mentioned having PVE and PBS on a shared network that is not the management network, and you configure PVE to use the PBS IP address on that shared network. If you do that, PVE will use the directly connected network.
There is a lot here; hopefully, you find it helpful.
