It's a seven node cluster spread out over two buildings. Connectivity is mostly gigabit with some 10G. Physically, it looks like this:
1G Switch A <-> 1G Switch B <-> 10G Switch C
node01 (192.168.1.201/24) - Switch A
node02 (192.168.1.202/24) - Switch A
node03 (192.168.1.203/24) - Switch B
node04 (192.168.1.204/24) - Switch B
node05 (192.168.1.205/24, 10.0.0.205/24) - Switch C
node06 (192.168.1.206/24, 10.0.0.206/24) - Switch C
node07 (192.168.1.207/24, 10.0.0.207/24) - Switch C
Latency is < 1ms everywhere. No packet loss.
nodes 5/6/7 all have Ceph running (a pretty much completely stock Proxmox/Ceph replicated 3x). Each has 4 enterprise class SSDs, 3 of which are for Ceph, so 9 SSDs in Ceph in total. All nodes 5/6/7 were built recently (last few months) and used the current Proxmox/Ceph version at the time (so 5.4 and Luminous).
The logical network is 2x /24s - a 192.168/24 for all seven nodes, plus a 10/24 for the three Ceph nodes (5/6/7). All Ceph traffic happens on the 10/24 network. Nodes 1/2/3/4 do not have any 10 addresses configured and so will never be able to reach Ceph.
All nodes are now running Proxmox 6, apart from node02 which is still on 5.4 due to be upgraded at the end of this week. I followed the upgrade guide for 5 -> 6 and for Luminous -> Nautilus to the letter, and the upgrades for all nodes went pretty much without a hitch (props to the devs!). Ceph is now using msgr2.
nodes1/3 have ceph-msgr running at 100%; maybe also node 4 but nodes 1 and 4 are spares and so were shut down after upgrade. node03 is currently online, and ceph-msgr is currently thrashing CPU. The network, hardware, VMs, config etc did not change between 5.4 and 6, Luminous and Nautilus (apart from the upgrade itself). This problem did not occur on 5.4.
I think that this might be a Proxmox issue, because Ceph wasn't installed on node03; I installing it after I saw the load (via the UI) thinking that it might help with the load issue. It didn't.
Thanks for looking into this.