Poor VM disk IOPS on hyperconverged PVE/CEPH-Cluster

Klaus SALM

Member
May 15, 2019
1
0
21
55
We are building a new PVE cluster with some of our existing servers and experience some troubling disk read/write values.

Our cluster design is as follows:
  • 3 Servers (will be upgraded to 7 nodes later)
    • Node1: 24 CPUs (2 sockets 6 core Intel Xeon), 128 GB RAM
    • Node2: 24 CPUs (2 sockets 6 core Intel Xeon), 128 GB RAM
    • Node3: 48 CPUs (2 sockets 12 core Intel Xeon), 256 GB RAM
  • OS installed on 256 GB SSDs
  • dedicated 1Gbit/s network interfaces for
    • "coro-sync"
    • "VM wan"
    • "node management"
  • dedicated 10Gbit/s network interfaces for
    • ceph (bond active-backup)
    • cephfs (bond active-backup)
CEPH has been configured as follows:
  • 2 x 2TB SSDs (crush rule "ssd", no RAID)
  • 2 x 2TB HDDs (crush rule "hdd", no RAID)
  • 1 OSD per disk
Before I provide the results of our fio tests i would like to explain the test scenario:
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75 --filename=/dev/sdb

PVE MH = reference system on Intel NUC with M2 SSD
PVE-CEPH = PVE cluster as described above with vm disk on CEPH
PVE-Local = PVE cluster with vm disk on local SSD (non CEPH)

1587035294350.png

Our interpretation of these values is that on the local disk on each host the performance is much higher than inside a virtual machine.
The HBA and disks are capable of ~230MB/s read and ~75MB/s write (as seen on row "native" colums "PVE-CEPH").
However when we create a disk inside a virtual machine we have a drop to ~13MB/s read and 4MB/s write at best (see row "iothread + writeback cache" enabled). Other options for the virtual disk show even worse IOPS.

To test CEPH outside of a virtual machine we tried the following:
ceph osd pool create scbench 100 100 ssd
rados bench 60 write -b 4M -t 16 -p scbench --no-cleanup
# result: 150MB/s
rados bench 60 seq -t 16 -p scbench # result: 1.300MB/s

We are quite puzzled as to why we have these issues with our cluster and would be grateful for any insight as to why the performance is rather bad inside our VMs whereas it's good on the network/nodes itself