I've been playing with a 4 node cluster using some server nodes that used to be in a hyperconverged setup. I had everything running on the main network and the performance was pretty slow. I then added a 1g switch for ceph_cluster and things improved maybe 2 times. After that I added 4 port sfp+ nics to all four nodes with one cable connected from each of the nodes to a 10g switch. This improved the performance more than 10 times.
After this I decided to try moving the ceph_public to the same 10g port with the ceph_cluster and performance went back down to where it was when it was all on a 1g connection. Since then I separated ceph_public and ceph_cluster to different subnets, connecting to two different 10g switches and the performance has not gone back up, if anything it's gone down.
All OSDs were restarted, then I restarted ceph.target on each node, then I restarted the whole nodes. No improvement after separating to two different 10g switches. I don't know what I'm missing here. We've also been using the storage pretty regularly on the server so I don't want to scrap the whole thing and start over if I don't have to. Reading forums I saw some people suggested switching MTU from the default 1500 to 9000 but its weird that I had a configuration where it was working decently and now that it's on a higher bandwidth network it's tanking.
Current setup:
Main network delivered over a 1g switch to each node. - 192.168.30.0/24
10g switch to one port on each of the nodes for ceph_public - 10.0.1.0/24
a separate 10g switch to each of the nodes for ceph_cluster - 10.0.0.0/24
After this I decided to try moving the ceph_public to the same 10g port with the ceph_cluster and performance went back down to where it was when it was all on a 1g connection. Since then I separated ceph_public and ceph_cluster to different subnets, connecting to two different 10g switches and the performance has not gone back up, if anything it's gone down.
All OSDs were restarted, then I restarted ceph.target on each node, then I restarted the whole nodes. No improvement after separating to two different 10g switches. I don't know what I'm missing here. We've also been using the storage pretty regularly on the server so I don't want to scrap the whole thing and start over if I don't have to. Reading forums I saw some people suggested switching MTU from the default 1500 to 9000 but its weird that I had a configuration where it was working decently and now that it's on a higher bandwidth network it's tanking.
Current setup:
Main network delivered over a 1g switch to each node. - 192.168.30.0/24
10g switch to one port on each of the nodes for ceph_public - 10.0.1.0/24
a separate 10g switch to each of the nodes for ceph_cluster - 10.0.0.0/24