Ceph I/O issues on all SSD cluster

May 3, 2019
2
0
6
39
OK, where to start. I have been debugging intensively the last two days, but can't seem to wrap my head around the performance issues we see in one of our two hyperconverged proxmox clusters.

Let me introduce our two clusters and some of the debugging results.

1. Proxmox for internal purposes (performs as expected)

3 x Supermicro servers with identical specs:
CPU: 1 x 7-7700K CPU @ 4.20GHz (1 Socket) 4 cores / 8 threads
RAM: 64 GB RAM
OSDs: 4 per node. 1 per SSD (Intel S4610) (12 OSDs in all)
1 x 10GbE RJ45 nic. MTU 9000 No bonding

A total of 3 servers with a total of 12 OSDs

Network:
1 x Unifi Switch 16 XG


2. Proxmox for VPS's for customers (performs much worse than internal)
3 x Dell R630 with the following specs:
CPU: 2 x E5-2697 v3 @ 2.60GHz (2 Sockets) 28 cores / 56 threads
RAM 256GB
OSDs: 10 per node. 1 per SSD (Intel S4610)
1 x 10GbE SFP+ nic with 2 ports bonded via LACP (bond-xmit-hash-policy layer3+4). MTU 9000

2 x Supermicro X11SRM-VF with the following specs:
CPU 1 x 1 W-2145 CPU @ 3.70GHz (1 Socket) 8 cores / 16 threads
RAM: 256 GB
OSDs 8 per node. 1 per SSD (Intel S4610)
1 x 10GbE SFP+ nic with 2 ports bonded via LACP (bond-xmit-hash-policy layer3+4). MTU 9000

1 x Dell R630 with the following specs:
CPU 2 x CPU E5-2696 v4 @ 2.20GHz (2 Sockets) 44 cores / 88 threads
RAM: 256 GB
OSDs 8 per node. 1 per SSD (Intel S4610)
1 x 10GbE SFP+ nic with 2 ports bonded via LACP (bond-xmit-hash-policy layer3+4). MTU 9000

A total of 6 servers with a total of 54 OSDs


Network:
2 x Dell N4032F 10GbE SFP+ Switch connected with MLAG. Each node is connected to each switch.

To get a fair comparison i made the following fio tests one one host in each cluster on a rbd block device that i created:

1. cluster:
fio --randrepeat=1 --ioengine=libaio --sync=1 --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randwrite test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.12 Starting 1 process test: Laying out IO file (1 file / 4096MiB) Jobs: 1 (f=1): [w(1)][100.0%][w=25.0MiB/s][w=6409 IOPS][eta 00m:00s] test: (groupid=0, jobs=1): err= 0: pid=869177: Wed Nov 13 11:15:59 2019 write: IOPS=4158, BW=16.2MiB/s (17.0MB/s)(4096MiB/252126msec); 0 zone resets bw ( KiB/s): min= 2075, max=32968, per=99.96%, avg=16627.60, stdev=9635.42, samples=504 iops : min= 518, max= 8242, avg=4156.88, stdev=2408.86, samples=504 cpu : usr=0.53%, sys=3.81%, ctx=109599, majf=0, minf=7 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=0,1048576,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): WRITE: bw=16.2MiB/s (17.0MB/s), 16.2MiB/s-16.2MiB/s (17.0MB/s-17.0MB/s), io=4096MiB (4295MB), run=252126-252126msec Disk stats (read/write): rbd0: ios=46/1221898, merge=0/1870438, ticks=25/4654920, in_queue=1980016, util=84.70%

2. cluster
fio --randrepeat=1 --ioengine=libaio --sync=1 --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randwrite test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.12 Starting 1 process Jobs: 1 (f=1): [w(1)][99.9%][w=7024KiB/s][w=1756 IOPS][eta 00m:01s] test: (groupid=0, jobs=1): err= 0: pid=794096: Wed Nov 13 11:25:56 2019 write: IOPS=1353, BW=5415KiB/s (5545kB/s)(4096MiB/774601msec); 0 zone resets bw ( KiB/s): min= 40, max=30600, per=100.00%, avg=5420.24, stdev=3710.17, samples=1547 iops : min= 10, max= 7650, avg=1355.06, stdev=927.54, samples=1547 cpu : usr=0.16%, sys=1.19%, ctx=100028, majf=0, minf=8 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=0,1048576,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): WRITE: bw=5415KiB/s (5545kB/s), 5415KiB/s-5415KiB/s (5545kB/s-5545kB/s), io=4096MiB (4295MB), run=774601-774601msec Disk stats (read/write): rbd0: ios=0/1222639, merge=0/1784089, ticks=0/12124812, in_queue=9514280, util=45.14%

And identical rados bench tests:

1. cluster: https://i.imgur.com/AdARCA6.png
2. cluster https://i.imgur.com/Di7mYQh.png

I have fio tested all disks. I have tested the network. I can't seem to find the reason why performance on my 2. cluster is relatively poor compared to 1. cluster.
 
So, you have 2x10Gbps SFP+ on customer cluster for all trafic (ceph public + cluster)? What is network utilization on that bond?

Some hints (i have S4610 SSDs too, but 2x10Gbps + 2x10Gbps bonds):
1] your network latency is worse than ours, we have max latency 0.3 in rados bench
2] you have different cpus with different speed => every cpu has different performance
3] how are OSDs connected to servers? hba? raid controller?
 
So, you have 2x10Gbps SFP+ on customer cluster for all trafic (ceph public + cluster)? What is network utilization on that bond?
yes thats true. How do you even configure separate public and cluster?

This is the bond without stressing:
Code:
bond0  /  traffic statistics

                           rx         |       tx
--------------------------------------+------------------
  bytes                   208.33 MiB  |      249.62 MiB
--------------------------------------+------------------
          max           72.01 Mbit/s  |   119.79 Mbit/s
      average           37.99 Mbit/s  |    45.52 Mbit/s
          min           22.23 Mbit/s  |    17.00 Mbit/s
--------------------------------------+------------------
  packets                     125645  |          126865
--------------------------------------+------------------
          max               6027 p/s  |        6114 p/s
      average               2731 p/s  |        2757 p/s
          min               1863 p/s  |        1769 p/s
--------------------------------------+------------------
  time                    46 seconds

And while running radosbench:
Code:
bond0  /  traffic statistics

                           rx         |       tx
--------------------------------------+------------------
  bytes                    15.45 GiB  |       51.62 GiB
--------------------------------------+------------------
          max            2.79 Gbit/s  |     9.16 Gbit/s
      average            2.46 Gbit/s  |     8.21 Gbit/s
          min            2.06 Gbit/s  |     7.02 Gbit/s
--------------------------------------+------------------
  packets                    2934412  |         6620471
--------------------------------------+------------------
          max              60553 p/s  |      136228 p/s
      average              54340 p/s  |      122601 p/s
          min              44961 p/s  |      103951 p/s
--------------------------------------+------------------
  time                    54 seconds


Some hints (i have S4610 SSDs too, but 2x10Gbps + 2x10Gbps bonds):
1] your network latency is worse than ours, we have max latency 0.3 in rados bench
2] you have different cpus with different speed => every cpu has different performance
3] how are OSDs connected to servers? hba? raid controller?

1. you are right, my max is 0.4 so not much difference, and an average latency on 0.05 which seems ok?
2. We started out with just the 3 identical nodes. The performance was pretty similar.
3. OSDs are connected as passthrough via a HBA330 controller

For the record. All osds have 0-1 ms latency.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!