Could you describe your ceph environments ? How many servers, how many switches and so on. 10GBe ?
.
I'm doing small 3 nodes cluster, 18-24 osd by cluster (1,6TB intel s3610 ssd or 3,2TB hstg nvme).
Fast cpu frequency (10-12 cores, 3ghz intel) by node. replication x3.
debian stretch/luminous bluestore and jessie/jewel filestore
2x10GB by ceph node. (ceph public and private network on same link)
2x10GB on proxmox node. (san + lan on same links, differents vlan)
proxmox node also have fast cpu (3ghz), to reduce latency.
I'm also using cephfs and radosgw for sharing datas in my vms, on a dedicated cluster.
Small clusters because it's more simple for upgrade, and if I don't have enough storage for a specific vm, we simply move disk with proxmox.
I known 2 peoples who"s have triggered this bug...The corruption bug (always triggered and well known, just rebalance a sharded volume to loose data) took years to be fixed
also I don't known if it's have changed, but resync a vm volume/file, needed to scan all blocks on the source file.