CephFS vs VirtIO SCSI Write IOPS

chrispage1

Member
Sep 1, 2021
90
47
23
32
Hi,

I've been testing our Proxmox Ceph cluster and have noticed something interesting. I've been running fio benchmarks against a CephFS mount and within a VM using VirtIO SCSI.

CephFS on /mnt/pve/cephfs -

Code:
root@pve03:/mnt/pve/cephfs# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=300 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
fio-3.25
Starting 1 process
random-write: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [F(1)][100.0%][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=2713399: Tue Jan 25 10:09:20 2022
  write: IOPS=28.5k, BW=111MiB/s (117MB/s)(33.8GiB/310610msec); 0 zone resets
    slat (nsec): min=1102, max=1509.1k, avg=4846.75, stdev=4642.53
    clat (nsec): min=440, max=8859.8M, avg=26917.20, stdev=6476482.94
     lat (usec): min=11, max=8859.9k, avg=31.76, stdev=6476.57
    clat percentiles (usec):
     |  1.00th=[   12],  5.00th=[   13], 10.00th=[   14], 20.00th=[   15],
     | 30.00th=[   16], 40.00th=[   17], 50.00th=[   19], 60.00th=[   20],
     | 70.00th=[   22], 80.00th=[   23], 90.00th=[   24], 95.00th=[   24],
     | 99.00th=[   31], 99.50th=[   34], 99.90th=[   47], 99.95th=[   58],
     | 99.99th=[  125]
   bw (  KiB/s): min=    8, max=255687, per=100.00%, avg=150895.07, stdev=53573.80, samples=469
   iops        : min=    2, max=63921, avg=37723.74, stdev=13393.43, samples=469
  lat (nsec)   : 500=0.01%, 750=0.01%, 1000=0.01%
  lat (usec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=64.42%, 50=35.51%
  lat (usec)   : 100=0.06%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  lat (msec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2000=0.01%, >=2000=0.01%
  cpu          : usr=16.42%, sys=24.24%, ctx=8862699, majf=0, minf=2870
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,8859814,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=111MiB/s (117MB/s), 111MiB/s-111MiB/s (117MB/s-117MB/s), io=33.8GiB (36.3GB), run=310610-310610msec

You can see here that we're getting 28.5k write IOPS overall on our /mnt/pve/cephfs filesystem - great!


However, when running the same FIO test within a VM that has a VirtIO SCSI mountpoint, we're averaging out at 381 write IOPS:

Code:
root@ubuntu-server:/mnt/scsi# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=300 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
fio-3.16
Starting 1 process
Jobs: 1 (f=1): [F(1)][100.0%][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=68937: Tue Jan 25 10:45:47 2022
  write: IOPS=381, BW=1526KiB/s (1562kB/s)(1255MiB/842421msec); 0 zone resets
    slat (usec): min=3, max=4430, avg=17.96, stdev=15.98
    clat (nsec): min=1314, max=123222k, avg=911777.04, stdev=6864969.31
     lat (usec): min=39, max=123242, avg=929.73, stdev=6864.87
    clat percentiles (usec):
     |  1.00th=[   35],  5.00th=[   50], 10.00th=[   57], 20.00th=[   59],
     | 30.00th=[   61], 40.00th=[   63], 50.00th=[   65], 60.00th=[   68],
     | 70.00th=[   71], 80.00th=[   75], 90.00th=[   87], 95.00th=[  111],
     | 99.00th=[37487], 99.50th=[66323], 99.90th=[78119], 99.95th=[81265],
     | 99.99th=[85459]
   bw (  KiB/s): min= 1280, max=47280, per=100.00%, avg=4283.81, stdev=7691.70, samples=600
   iops        : min=  320, max=11820, avg=1070.95, stdev=1922.93, samples=600
  lat (usec)   : 2=0.01%, 4=0.02%, 10=0.01%, 20=0.01%, 50=4.99%
  lat (usec)   : 100=88.69%, 250=4.15%, 500=0.15%, 750=0.03%, 1000=0.01%
  lat (msec)   : 2=0.02%, 4=0.01%, 10=0.02%, 20=0.36%, 50=0.64%
  lat (msec)   : 100=0.89%, 250=0.01%
  cpu          : usr=0.72%, sys=1.28%, ctx=325435, majf=0, minf=49
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,321312,0,1 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1526KiB/s (1562kB/s), 1526KiB/s-1526KiB/s (1562kB/s-1562kB/s), io=1255MiB (1316MB), run=842421-842421msec

Disk stats (read/write):
  sda: ios=0/288012, merge=0/3854, ticks=0/8575607, in_queue=8004080, util=99.36%

Now this system doesn't have the best disks in (Samsung 870 EVO's) and I know they're not the best with Ceph, but we're in the process of swapping them out with Samsung PM893's. But why would that make a difference between CephFS and a block device?
 
Please provide the VM config (qm config <VMID>).
 
Hi, config is below. Please note this is purely for testing Ceph performance, hence the multiple disks (scsi0, scsi1, virtio1)

Code:
boot: order=scsi0;ide2;net0
cores: 4
ide2: cephfs:iso/ubuntu-20.04.3-live-server-amd64.iso,media=cdrom
memory: 4096
meta: creation-qemu=6.1.0,ctime=1639144331
name: iotest
net0: virtio=6A:E3:F6:12:38:E3,bridge=vmbr0,tag=3103
numa: 0
ostype: l26
scsi0: ceph-cluster:vm-109-disk-0,discard=on,size=32G
scsi1: ceph-cluster:vm-109-disk-1,size=32G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=65557df6-4d98-49af-b019-efaa49af6a97
sockets: 1
virtio1: ceph-cluster:vm-109-disk-2,discard=on,size=32G
vmgenid: bc110a93-f080-4ef2-895a-b01e97438c37
 
I'd suggest running the benchmark again with the following parameters:
fio --name=random-write --ioengine=posixaio --rw=write --bs=4k --numjobs=1 --size=20g --iodepth=1 --runtime=600 --time_based --sync=1 --direct=1
To make sure cache doesn't play too big a role in the tests.

This will provide a baseline you can use for further testing.
 
@mira thanks - that makes sense. So with the test run as you suggested.

Directly against SSD I'm getting 20.0MiB/s with 5,366 IOPS.

With CephFS I'm getting 420KiB/s with 105 IOPS (so a pretty significant drop).

With SCSI mount I'm getting 144KiB/s with 36 IOPS (a huge drop! However much more in line with CephFS this time).

So, my only real question: does Ceph degrade overall performance? I thought it was capable of acting like a RAID array and boosting performance by distributing the load of multiple writes across our disks?

I think the issues I'm experiencing with drastic IO issues are purely a result of power loss protection not being available on my non-enterprise hard drives. As I say we've got some on order.
 
The CephFS is mounted in the guest?

Consumer SSDs are rather bad when it comes to lots of IOPS. And you have a lot of overhead from Ceph and from the Guest/QEMU/VirtIO.
You could try specifying the Cache Mode of the disk as `Writeback`, and maybe use the VirtIO SCSI Single controller together with the IO Thread option.
 
The CephFS I was testing is mounted on the node itself.

I've just given the writeback mode a go and it didn't make much difference. At least now I have some benchmarks to work to so when we get the new disks in I've got a good baseline.

Thanks,
Chris.