Hi,
I've been testing our Proxmox Ceph cluster and have noticed something interesting. I've been running fio benchmarks against a CephFS mount and within a VM using VirtIO SCSI.
CephFS on /mnt/pve/cephfs -
You can see here that we're getting
However, when running the same FIO test within a VM that has a VirtIO SCSI mountpoint, we're averaging out at
Now this system doesn't have the best disks in (Samsung 870 EVO's) and I know they're not the best with Ceph, but we're in the process of swapping them out with Samsung PM893's. But why would that make a difference between CephFS and a block device?
I've been testing our Proxmox Ceph cluster and have noticed something interesting. I've been running fio benchmarks against a CephFS mount and within a VM using VirtIO SCSI.
CephFS on /mnt/pve/cephfs -
Code:
root@pve03:/mnt/pve/cephfs# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=300 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
fio-3.25
Starting 1 process
random-write: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [F(1)][100.0%][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=2713399: Tue Jan 25 10:09:20 2022
write: IOPS=28.5k, BW=111MiB/s (117MB/s)(33.8GiB/310610msec); 0 zone resets
slat (nsec): min=1102, max=1509.1k, avg=4846.75, stdev=4642.53
clat (nsec): min=440, max=8859.8M, avg=26917.20, stdev=6476482.94
lat (usec): min=11, max=8859.9k, avg=31.76, stdev=6476.57
clat percentiles (usec):
| 1.00th=[ 12], 5.00th=[ 13], 10.00th=[ 14], 20.00th=[ 15],
| 30.00th=[ 16], 40.00th=[ 17], 50.00th=[ 19], 60.00th=[ 20],
| 70.00th=[ 22], 80.00th=[ 23], 90.00th=[ 24], 95.00th=[ 24],
| 99.00th=[ 31], 99.50th=[ 34], 99.90th=[ 47], 99.95th=[ 58],
| 99.99th=[ 125]
bw ( KiB/s): min= 8, max=255687, per=100.00%, avg=150895.07, stdev=53573.80, samples=469
iops : min= 2, max=63921, avg=37723.74, stdev=13393.43, samples=469
lat (nsec) : 500=0.01%, 750=0.01%, 1000=0.01%
lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=64.42%, 50=35.51%
lat (usec) : 100=0.06%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2000=0.01%, >=2000=0.01%
cpu : usr=16.42%, sys=24.24%, ctx=8862699, majf=0, minf=2870
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,8859814,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=111MiB/s (117MB/s), 111MiB/s-111MiB/s (117MB/s-117MB/s), io=33.8GiB (36.3GB), run=310610-310610msec
You can see here that we're getting
28.5k
write IOPS overall on our /mnt/pve/cephfs filesystem - great!However, when running the same FIO test within a VM that has a VirtIO SCSI mountpoint, we're averaging out at
381
write IOPS:
Code:
root@ubuntu-server:/mnt/scsi# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=300 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
fio-3.16
Starting 1 process
Jobs: 1 (f=1): [F(1)][100.0%][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=68937: Tue Jan 25 10:45:47 2022
write: IOPS=381, BW=1526KiB/s (1562kB/s)(1255MiB/842421msec); 0 zone resets
slat (usec): min=3, max=4430, avg=17.96, stdev=15.98
clat (nsec): min=1314, max=123222k, avg=911777.04, stdev=6864969.31
lat (usec): min=39, max=123242, avg=929.73, stdev=6864.87
clat percentiles (usec):
| 1.00th=[ 35], 5.00th=[ 50], 10.00th=[ 57], 20.00th=[ 59],
| 30.00th=[ 61], 40.00th=[ 63], 50.00th=[ 65], 60.00th=[ 68],
| 70.00th=[ 71], 80.00th=[ 75], 90.00th=[ 87], 95.00th=[ 111],
| 99.00th=[37487], 99.50th=[66323], 99.90th=[78119], 99.95th=[81265],
| 99.99th=[85459]
bw ( KiB/s): min= 1280, max=47280, per=100.00%, avg=4283.81, stdev=7691.70, samples=600
iops : min= 320, max=11820, avg=1070.95, stdev=1922.93, samples=600
lat (usec) : 2=0.01%, 4=0.02%, 10=0.01%, 20=0.01%, 50=4.99%
lat (usec) : 100=88.69%, 250=4.15%, 500=0.15%, 750=0.03%, 1000=0.01%
lat (msec) : 2=0.02%, 4=0.01%, 10=0.02%, 20=0.36%, 50=0.64%
lat (msec) : 100=0.89%, 250=0.01%
cpu : usr=0.72%, sys=1.28%, ctx=325435, majf=0, minf=49
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,321312,0,1 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=1526KiB/s (1562kB/s), 1526KiB/s-1526KiB/s (1562kB/s-1562kB/s), io=1255MiB (1316MB), run=842421-842421msec
Disk stats (read/write):
sda: ios=0/288012, merge=0/3854, ticks=0/8575607, in_queue=8004080, util=99.36%
Now this system doesn't have the best disks in (Samsung 870 EVO's) and I know they're not the best with Ceph, but we're in the process of swapping them out with Samsung PM893's. But why would that make a difference between CephFS and a block device?