First of, the Ceph Public network is mandatory. The Ceph Cluster network is optional and, if configured, will be used for the replication traffic between the OSDs. Taking away quite a bit of load from the Ceph Public network.
Both networks need to be fast (clients, e.g. VMs, accessing the cluster via the Ceph Public network) and reliable.
The network used for the Proxmox VE cluster (corosync) doesn't need to have a lot of bandwidth, but low latency is very important. Corosync can handle up to 8 networks by itself and switch if one network is deemed unusable.
It is best practice to give Corosync at least one dedicated physical network (1 Gbit is usually enough) for itself. This way, there won't be interference if another service is taking up all the available bandwidth. Configuring additional networks is a good idea to give Corosync fallback options, should there be issues with the dedicated one.
See the docs (
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy) on how to add additional networks to the Corosync config.
Using only one Corosync network, and having that shared with Ceph can quickly lead to issues. How bad they are, depends. Do you use HA? Then the results can be quite catastrophic.
If Ceph starts using up all the available bandwidth, the latency for Corosync goes up. Maybe to the point where it considers the network unusable. If it cannot switch to another network, the Proxmox VE cluster connection is lost -> the node is not part of the quorum anymore ->
/etc/pve/
is read-only.
That means, any action that wants to write there will fail, for example changing some configs, starting a guest and so forth.
If the node has HA guests, the LRM will be in active mode. That increases the severity. The HA stack uses Corosync to determine if a node is still part of the cluster. If the node lost the connection for ~1 minute and the LRM is active (due to HA guests running on the node), it will fence itself (hard reset). It does that, to make sure that the HA guests are definitely powered down, before the (hopefully) remaining cluster will start these guests on other nodes.
If Ceph is the reason for the lost Corosync connection, it is likely that all nodes are affected. The result would be, that the whole cluster (if all nodes had HA guests running) will do a hard reset.
So, give Ceph good reliable networks for both, Cluster and Public, and give Corosync options to switch physical networks and ideally its physical network.