We've had Proxmox with Ceph for over 5 years now and have deployed a production cluster to move off of VMware and NetApp. We've got about 60T of NVME in a dedicated pool, 15T of SSD and 20T of HDD fronted by SSD configured in Ceph.
Overview:
5 dedicated storage nodes and 4 compute nodes with 25G bonded ethernet for the backend all connected to Arista switching over fiber. The compute nodes have 2x10G for their front end.
Testing at the command line yields expected high performance with the expected breaks in performance by storage class. We are running into problems with high IOPs / throughput scenarios when the source is a VM (obviously kvm/qemu). We get great initial performance when an intense IO operation starts and then it gets clearly throttled. A database load that takes 29 minutes on VMware backed by old slow netapp storage takes seven hours when using an NVME drive.
The performance drop-off seems to occur at relatively the same wall clock point every time which indicates we're hitting some programmatic problem. We've moved from librbd to KRBD in testing and that helps because overall its faster but the dropoff still happens at the same time. It is almost like qemu is throttling but we don't have any disks in a throttle group and don't have anything configured to throttle. We see this behavior on Windows as well. Starting to suspect qemu so if there's something we need to tune or configure there maybe we've missed it.
All those tests are on the NVME pool. Any thoughts or suggestions for what to try are appreciated. We have great performance in most instances but these are showstoppers.
Overview:
5 dedicated storage nodes and 4 compute nodes with 25G bonded ethernet for the backend all connected to Arista switching over fiber. The compute nodes have 2x10G for their front end.
Testing at the command line yields expected high performance with the expected breaks in performance by storage class. We are running into problems with high IOPs / throughput scenarios when the source is a VM (obviously kvm/qemu). We get great initial performance when an intense IO operation starts and then it gets clearly throttled. A database load that takes 29 minutes on VMware backed by old slow netapp storage takes seven hours when using an NVME drive.
The performance drop-off seems to occur at relatively the same wall clock point every time which indicates we're hitting some programmatic problem. We've moved from librbd to KRBD in testing and that helps because overall its faster but the dropoff still happens at the same time. It is almost like qemu is throttling but we don't have any disks in a throttle group and don't have anything configured to throttle. We see this behavior on Windows as well. Starting to suspect qemu so if there's something we need to tune or configure there maybe we've missed it.
Code:
rados write bench, default 16 threads
Total time run: 120.044
Total writes made: 38196
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1272.73
Stddev Bandwidth: 139.248
Max bandwidth (MB/sec): 1600
Min bandwidth (MB/sec): 868
Average IOPS: 318
Stddev IOPS: 34.8119
Max IOPS: 400
Min IOPS: 217
Average Latency(s): 0.0502733
Stddev Latency(s): 0.0332458
Max latency(s): 0.992938
Min latency(s): 0.0174327
rados seq bench default 16 threads
Total time run: 76.7042
Total reads made: 38196
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1991.86
Average IOPS: 497
Stddev IOPS: 15.5239
Max IOPS: 531
Min IOPS: 461
Average Latency(s): 0.031651
Max latency(s): 0.246385
Min latency(s): 0.00564461
rados rand bench default 16 threads
Total time run: 120.029
Total reads made: 60349
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 2011.15
Average IOPS: 502
Stddev IOPS: 12.747
Max IOPS: 531
Min IOPS: 474
Average Latency(s): 0.031409
Max latency(s): 0.217155
Min latency(s): 0.00256268
rbd bench-write localimage --pool=NVME
rbd: bench-write is deprecated, use rbd bench --io-type write ...
bench type write io_size 4096 io_threads 16 bytes 1073741824 pattern sequential
SEC OPS OPS/SEC BYTES/SEC
1 118096 118227 462 MiB/s
2 231504 115815 452 MiB/s
elapsed: 2 ops: 262144 ops/sec: 114972 bytes/sec: 449 MiB/s
All those tests are on the NVME pool. Any thoughts or suggestions for what to try are appreciated. We have great performance in most instances but these are showstoppers.