CephFS vs VirtIO SCSI Write IOPS


Sep 1, 2021

I've been testing our Proxmox Ceph cluster and have noticed something interesting. I've been running fio benchmarks against a CephFS mount and within a VM using VirtIO SCSI.

CephFS on /mnt/pve/cephfs -

root@pve03:/mnt/pve/cephfs# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=300 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
Starting 1 process
random-write: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [F(1)][100.0%][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=2713399: Tue Jan 25 10:09:20 2022
  write: IOPS=28.5k, BW=111MiB/s (117MB/s)(33.8GiB/310610msec); 0 zone resets
    slat (nsec): min=1102, max=1509.1k, avg=4846.75, stdev=4642.53
    clat (nsec): min=440, max=8859.8M, avg=26917.20, stdev=6476482.94
     lat (usec): min=11, max=8859.9k, avg=31.76, stdev=6476.57
    clat percentiles (usec):
     |  1.00th=[   12],  5.00th=[   13], 10.00th=[   14], 20.00th=[   15],
     | 30.00th=[   16], 40.00th=[   17], 50.00th=[   19], 60.00th=[   20],
     | 70.00th=[   22], 80.00th=[   23], 90.00th=[   24], 95.00th=[   24],
     | 99.00th=[   31], 99.50th=[   34], 99.90th=[   47], 99.95th=[   58],
     | 99.99th=[  125]
   bw (  KiB/s): min=    8, max=255687, per=100.00%, avg=150895.07, stdev=53573.80, samples=469
   iops        : min=    2, max=63921, avg=37723.74, stdev=13393.43, samples=469
  lat (nsec)   : 500=0.01%, 750=0.01%, 1000=0.01%
  lat (usec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=64.42%, 50=35.51%
  lat (usec)   : 100=0.06%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  lat (msec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2000=0.01%, >=2000=0.01%
  cpu          : usr=16.42%, sys=24.24%, ctx=8862699, majf=0, minf=2870
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,8859814,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=111MiB/s (117MB/s), 111MiB/s-111MiB/s (117MB/s-117MB/s), io=33.8GiB (36.3GB), run=310610-310610msec

You can see here that we're getting 28.5k write IOPS overall on our /mnt/pve/cephfs filesystem - great!

However, when running the same FIO test within a VM that has a VirtIO SCSI mountpoint, we're averaging out at 381 write IOPS:

root@ubuntu-server:/mnt/scsi# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=300 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
Starting 1 process
Jobs: 1 (f=1): [F(1)][100.0%][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=68937: Tue Jan 25 10:45:47 2022
  write: IOPS=381, BW=1526KiB/s (1562kB/s)(1255MiB/842421msec); 0 zone resets
    slat (usec): min=3, max=4430, avg=17.96, stdev=15.98
    clat (nsec): min=1314, max=123222k, avg=911777.04, stdev=6864969.31
     lat (usec): min=39, max=123242, avg=929.73, stdev=6864.87
    clat percentiles (usec):
     |  1.00th=[   35],  5.00th=[   50], 10.00th=[   57], 20.00th=[   59],
     | 30.00th=[   61], 40.00th=[   63], 50.00th=[   65], 60.00th=[   68],
     | 70.00th=[   71], 80.00th=[   75], 90.00th=[   87], 95.00th=[  111],
     | 99.00th=[37487], 99.50th=[66323], 99.90th=[78119], 99.95th=[81265],
     | 99.99th=[85459]
   bw (  KiB/s): min= 1280, max=47280, per=100.00%, avg=4283.81, stdev=7691.70, samples=600
   iops        : min=  320, max=11820, avg=1070.95, stdev=1922.93, samples=600
  lat (usec)   : 2=0.01%, 4=0.02%, 10=0.01%, 20=0.01%, 50=4.99%
  lat (usec)   : 100=88.69%, 250=4.15%, 500=0.15%, 750=0.03%, 1000=0.01%
  lat (msec)   : 2=0.02%, 4=0.01%, 10=0.02%, 20=0.36%, 50=0.64%
  lat (msec)   : 100=0.89%, 250=0.01%
  cpu          : usr=0.72%, sys=1.28%, ctx=325435, majf=0, minf=49
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,321312,0,1 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1526KiB/s (1562kB/s), 1526KiB/s-1526KiB/s (1562kB/s-1562kB/s), io=1255MiB (1316MB), run=842421-842421msec

Disk stats (read/write):
  sda: ios=0/288012, merge=0/3854, ticks=0/8575607, in_queue=8004080, util=99.36%

Now this system doesn't have the best disks in (Samsung 870 EVO's) and I know they're not the best with Ceph, but we're in the process of swapping them out with Samsung PM893's. But why would that make a difference between CephFS and a block device?
Please provide the VM config (qm config <VMID>).
Hi, config is below. Please note this is purely for testing Ceph performance, hence the multiple disks (scsi0, scsi1, virtio1)

boot: order=scsi0;ide2;net0
cores: 4
ide2: cephfs:iso/ubuntu-20.04.3-live-server-amd64.iso,media=cdrom
memory: 4096
meta: creation-qemu=6.1.0,ctime=1639144331
name: iotest
net0: virtio=6A:E3:F6:12:38:E3,bridge=vmbr0,tag=3103
numa: 0
ostype: l26
scsi0: ceph-cluster:vm-109-disk-0,discard=on,size=32G
scsi1: ceph-cluster:vm-109-disk-1,size=32G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=65557df6-4d98-49af-b019-efaa49af6a97
sockets: 1
virtio1: ceph-cluster:vm-109-disk-2,discard=on,size=32G
vmgenid: bc110a93-f080-4ef2-895a-b01e97438c37
I'd suggest running the benchmark again with the following parameters:
fio --name=random-write --ioengine=posixaio --rw=write --bs=4k --numjobs=1 --size=20g --iodepth=1 --runtime=600 --time_based --sync=1 --direct=1
To make sure cache doesn't play too big a role in the tests.

This will provide a baseline you can use for further testing.
@mira thanks - that makes sense. So with the test run as you suggested.

Directly against SSD I'm getting 20.0MiB/s with 5,366 IOPS.

With CephFS I'm getting 420KiB/s with 105 IOPS (so a pretty significant drop).

With SCSI mount I'm getting 144KiB/s with 36 IOPS (a huge drop! However much more in line with CephFS this time).

So, my only real question: does Ceph degrade overall performance? I thought it was capable of acting like a RAID array and boosting performance by distributing the load of multiple writes across our disks?

I think the issues I'm experiencing with drastic IO issues are purely a result of power loss protection not being available on my non-enterprise hard drives. As I say we've got some on order.
The CephFS is mounted in the guest?

Consumer SSDs are rather bad when it comes to lots of IOPS. And you have a lot of overhead from Ceph and from the Guest/QEMU/VirtIO.
You could try specifying the Cache Mode of the disk as `Writeback`, and maybe use the VirtIO SCSI Single controller together with the IO Thread option.
The CephFS I was testing is mounted on the node itself.

I've just given the writeback mode a go and it didn't make much difference. At least now I have some benchmarks to work to so when we get the new disks in I've got a good baseline.
