I upgraded ceph on my 21 node cluster from 14.2.15 to 14.2.20 and restarted all services except OSDs. I am using dual 40 gig ethernet and I was seeing about 1.8 GB/s on rebalancing, but now I am seeing less than 100 MB/s. CephFS has dropped to an embarrassing 61.5 MB/s with fio.
Jobs: 1 (f=1): [_(8),w(1),_(7)][100.0%][w=17.5MiB/s][w=35 IOPS][eta 00m:00s]
fio.write.out: (groupid=0, jobs=16): err= 0: pid=621448: Tue May 4 10:13:27 2021
write: IOPS=117, BW=58.6MiB/s (61.5MB/s)(4096MiB/69891msec); 0 zone resets
clat (msec): min=29, max=1326, avg=129.19, stdev=102.06
lat (msec): min=29, max=1326, avg=129.21, stdev=102.06
clat percentiles (msec):
| 1.00th=[ 37], 5.00th=[ 43], 10.00th=[ 50], 20.00th=[ 61],
| 30.00th=[ 75], 40.00th=[ 88], 50.00th=[ 100], 60.00th=[ 115],
| 70.00th=[ 136], 80.00th=[ 171], 90.00th=[ 249], 95.00th=[ 321],
| 99.00th=[ 493], 99.50th=[ 634], 99.90th=[ 1011], 99.95th=[ 1116],
| 99.99th=[ 1334]
bw ( KiB/s): min= 1021, max= 8192, per=6.68%, avg=4006.17, stdev=1464.78, samples=2072
iops : min= 1, max= 16, avg= 7.80, stdev= 2.87, samples=2072
lat (msec) : 50=10.88%, 100=39.48%, 250=39.70%, 500=8.96%, 750=0.65%
lat (msec) : 1000=0.23%
cpu : usr=0.03%, sys=0.06%, ctx=8312, majf=0, minf=130
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,8192,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=58.6MiB/s (61.5MB/s), 58.6MiB/s-58.6MiB/s (61.5MB/s-61.5MB/s), io=4096MiB (4295MB), run=69891-69891msec
Jobs: 1 (f=1): [_(8),w(1),_(7)][100.0%][w=17.5MiB/s][w=35 IOPS][eta 00m:00s]
fio.write.out: (groupid=0, jobs=16): err= 0: pid=621448: Tue May 4 10:13:27 2021
write: IOPS=117, BW=58.6MiB/s (61.5MB/s)(4096MiB/69891msec); 0 zone resets
clat (msec): min=29, max=1326, avg=129.19, stdev=102.06
lat (msec): min=29, max=1326, avg=129.21, stdev=102.06
clat percentiles (msec):
| 1.00th=[ 37], 5.00th=[ 43], 10.00th=[ 50], 20.00th=[ 61],
| 30.00th=[ 75], 40.00th=[ 88], 50.00th=[ 100], 60.00th=[ 115],
| 70.00th=[ 136], 80.00th=[ 171], 90.00th=[ 249], 95.00th=[ 321],
| 99.00th=[ 493], 99.50th=[ 634], 99.90th=[ 1011], 99.95th=[ 1116],
| 99.99th=[ 1334]
bw ( KiB/s): min= 1021, max= 8192, per=6.68%, avg=4006.17, stdev=1464.78, samples=2072
iops : min= 1, max= 16, avg= 7.80, stdev= 2.87, samples=2072
lat (msec) : 50=10.88%, 100=39.48%, 250=39.70%, 500=8.96%, 750=0.65%
lat (msec) : 1000=0.23%
cpu : usr=0.03%, sys=0.06%, ctx=8312, majf=0, minf=130
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,8192,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=58.6MiB/s (61.5MB/s), 58.6MiB/s-58.6MiB/s (61.5MB/s-61.5MB/s), io=4096MiB (4295MB), run=69891-69891msec