Some thoughts...
  • According to configuration your ceph public network is shared with pve interface and presumably vm access. Your ceph networks should be isolated, it's fairly sensitive to latency so mixing all of that other traffic is going to contend with high priority storage traffic, I doubt you have QoS dialed in to account for this.
  • I wouldn't use linux bridges for ceph interfaces, use the physical interface alone. Seems like an unnecessary abstraction layer in this case.
  • Ensure MTU 9000 (jumbo frames) is supported by your hardware and is set on every interface, from physical interfaces, to linux bridges (if you must), and switchports. You may need to reset or restart hardware after enabling jumbo frames. MTU 9000 is especially useful for SAN applications, and would be useful for ceph public and cluster networks but not so much for the PVE management interface.
    • I would leave vmbr0 MTU 1500, only 9000 for ceph networks.
  • With the way things seem intermittent or unstable, the issue has a L2 smell to it. May be MTU but might also be ARP or STP.
I would remove vmbr1 and combine your ceph public/cluster networks and put their IP on eno2np1 with MTU 9000, verify switchports are configured to handle MTU 9000 (including trunk ports if they're involved). Add some 10G NICs in the future, break ceph public/cluster networks up again so that they each get one or more dedicated 10G links.
 
  • Like
Reactions: Johannes S and UdoB
  • According to configuration your ceph public network is shared with pve interface and presumably vm access. Your ceph networks should be isolated, it's fairly sensitive to latency so mixing all of that other traffic is going to contend with high priority storage traffic, I doubt you have QoS dialed in to account for this.
Precisely. This is what all the documentation says, which is why I gave isolating Ceph's traffic a try.
I wouldn't use linux bridges for ceph interfaces, use the physical interface alone. Seems like an unnecessary abstraction layer in this case.
So SR-IOV enabled on the Ceph private interfaces?
Ensure MTU 9000 (jumbo frames) is supported by your hardware and is set on every interface, from physical interfaces, to linux bridges (if you must), and switchports. You may need to reset or restart hardware after enabling jumbo frames. MTU 9000 is especially useful for SAN applications, and would be useful for ceph public and cluster networks but not so much for the PVE management interface
The hardware is four R740s and one R730, they are all using identical Mellanox ConnectX-4 LX Dual Port 25GbE SFP+ NICs.
 
The linux bridge is vmbr1. Move the IP address on vmbr1 to eno2np1 and remove vmbr1. Linux bridges are very nice and cool and performant but they're unnecessary here and add complication, for instance they may be set at MTU 1500 despite the parent interface and the rest of the stack being set to 9000. You can check if this is the case right now with ip link | grep vmbr1: | grep --color -E 'mtu\s[0-9]+'

If you're going to set MTU 9000 make sure you set it on the server eno2np1 interface as well as switch that the servers are connected to.
 
Hi fakebizprez
I understand the symptoms of your ceph cluster are:
  • 'slow ops' on numerous OSDs across several nodes, and when restarting an OSD the error briefly disappears but comes back
  • 'slow metadata IO' reported by one or several MDSs
  • Due to the above the affected cephfs becomes 'degraded'
  • After several days there may be 'mds behind on trimming' messages too - because that gets stuck also
  • Increasing 'Degraded data redundancy' counts, while no OSDs have been removed
  • 'Reduced data availability' with PGs 'peering' or 'activating', stuck in this state and will never complete to become 'active+healthy'
  • While there are still many PG which need 'recovery' or 'backfill' and are amber on the donut chart, the recovery and backfill traffic slows down and no matter what you do, at some point it stops.
The above is what I experienced with my cluster (ceph version 18.2.4 reef (stable)), and this is what I believe how I created this problem:
  • I added several new pools to the cluster with 256 PGs each. Some being used for cephfs, so they have additional metadata pools with 32 PGs each.
  • In total I ended up with 14 pools, and a total of ~1'700 PGs.
  • Since I have a suboptimal combination of OSD sizes on the nodes, with very different capacity/disksizes, I get an uneven distribution of PGs across the OSDs, and the larger OSDs of course attract more PGs. This makes things worse.
  • I removed some OSDs to eventually remove a node from the cluster, further compressing more PGs on fewer OSDs.
  • Ceph has 'tuning parameters' for allocating PGs to OSDs:
    • mon_max_pg_per_osd - in my case set to default 250 - desired max number of PGs per OSD (what Ceph is aiming for)
    • osd_max_pg_per_osd_hard_ratio - in my case set to default 3 - ratio of exceedance of the above value where Ceph will stop creating new PGs
    • Which means at 250 * 3 = 750 PGs on one OSD, it will stop creating new PGs on that OSD and things get stuck
  • Since CRUSH will dictate that this PG must be on that OSD, it cannot just create it on another OSD, so things come to a halt.
  • My understanding of the documentation is that if you would exceed that limit by creating a new pool, it would throw a TOO_MANY_PGS warning. But when you get there by removing OSDs, while you were close to the limit before already, Ceph seems to be unable to detect that this is the underlying issue.
  • This has been discussed in the Ceph mailing list before.
Short term fix which I used:
  • Increase the mon_max_pg_per_osd to a number high enough to allow PG creation on your most crowded OSD to restore operational capability of the cluster.
Bash:
ceph config set osd mon_max_pg_per_osd 1000

This may have undesirable sidefects since now all OSDs aim for this PG count.

Short term fix which is probably better:
  • Increase the osd_max_pg_per_osd_hard_ratio to a number high enough to allow PG creation on your most crowded OSD to restore operational capability of the cluster.
Bash:
ceph config set osd osd_max_pg_per_osd_hard_ratio 5

This is probably the better quick fix, since it should only allow PG creation on troublesome OSDs, but not affect the others.

What I did as long term fix - maybe there are better suggestions?:
Since the developers of Ceph are clever, and did set the default value of mon_max_pg_per_osd to 250, I suspect that this is a good number of PGs per OSD, and we should aim for that.
  • Reduce the number of PGs per pool
    • Determining the appropriate number of PGs for a pool is an interesting subject which deserves its own thread...
    • In my case 128 PG per pool are no worse than 256.
Bash:
ceph osd pool set <pool name> pg_num <number of pgs>

Note that this will initially change pg_num_target and pgp_num_target, will need a lot of data to be moved around and will take a while of 'backfilling' to complete. If you then think your backfills are too slow, I also wrote a post to improve that.You can check that the settings have been applied correctly:

Bash:
ceph osd pool ls detail | grep <pool name>
  • Reduce the number of pools. Maybe some are redundant or can be combined.
  • Add Nodes and OSDs.
  • Once all is happy again, restore the default settings by deleting the custom settings from the config DB:
Code:
ceph config rm osd osd_max_pg_per_osd_hard_ratio
ceph config rm osd mon_max_pg_per_osd

I hope that is applicable to your problem and helps somebody. cheers
Edit: typo
 
Last edited: