Diagnosing slow ceph performance

poxin

Well-Known Member
Jun 27, 2017
70
6
48
Having issues trying to figure out where the issues lies with performance here.

I currently have 5 nodes, each node containing 5 SSD samsung evo disks (I know consumer drives are not the best, but I still wouldn't expect performance to be this low)

the ceph public and cluster network are using Mellanox Connect-X3 with latest firmware into a Mellanox SX1012 at 40Gb over Ethernet

Each node is the following:
24 x Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz (2 Sockets)
128 GB RAM
Proxmox v7.0-10

Using rados it appears that writes just.. stop? I'm getting current MB/s to be 0 a lot of the time. I created a CT on a ceph pool, and testing disk using DD there, not much better.
Code:
root@testct:~# dd if=/dev/zero of=/tmp/test1.img bs=1G count=1 oflag=dsync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 14.3277 s, 74.9 MB/s

Code:
:~# iperf -c 10.3.32.185 -P 2
------------------------------------------------------------
Client connecting to 10.3.32.185, TCP port 5001
TCP window size:  165 KByte (default)
------------------------------------------------------------
[  3] local 10.3.32.186 port 35052 connected with 10.3.32.185 port 5001
[  4] local 10.3.32.186 port 35054 connected with 10.3.32.185 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3] 0.0000-10.0002 sec  16.2 GBytes  13.9 Gbits/sec
[  4] 0.0000-10.0001 sec  17.9 GBytes  15.4 Gbits/sec
[SUM] 0.0000-10.0001 sec  34.1 GBytes  29.3 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) = 0.126/0.190/0.253/0.126 ms (tot/err) = 2/0

Code:
:~# rados bench -p test 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_0051_23356
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16        71        55   219.947       220   0.0346255   0.0483225
    2      16        82        66   131.971        44   0.0364492   0.0927038
    3      16        82        66   87.9818         0           -   0.0927038
    4      16        82        66   65.9868         0           -   0.0927038
    5      16        82        66   52.7898         0           -   0.0927038
    6      16        82        66   43.9915         0           -   0.0927038
    7      16        82        66   37.7072         0           -   0.0927038
    8      16        82        66   32.9938         0           -   0.0927038
    9      16        82        66   29.3279         0           -   0.0927038
   10      16        82        66   26.3951         0           -   0.0927038
   11      16        82        66   23.9956         0           -   0.0927038
   12      16        82        66   21.9959         0           -   0.0927038
Total time run:         12.6548
Total writes made:      82
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     25.9189
Stddev Bandwidth:       63.6239
Max bandwidth (MB/sec): 220
Min bandwidth (MB/sec): 0
Average IOPS:           6
Stddev IOPS:            15.906
Max IOPS:               55
Min IOPS:               0
Average Latency(s):     2.46885
Stddev Latency(s):      4.86423
Max latency(s):         12.6533
Min latency(s):         0.0248786

Code:
:~# cat /etc/ceph/ceph.conf 
[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 10.3.32.0/24
     fsid = 6e26c2db-7adf-401b-a944-26610d0c77e7
     mon_allow_pool_delete = true
     mon_host = 10.3.32.185 10.3.32.186 10.3.32.187 10.3.32.188 10.3.32.189
     ms_bind_ipv4 = true
     ms_bind_ipv6 = false
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 10.3.32.0/24

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
     keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.phy-hv-sl-0049]
     host = 0049-phy-hv-sl
     mds standby for name = pve

[mds.phy-hv-sl-0050]
     host = 0050-phy-hv-sl
     mds_standby_for_name = pve

[mds.phy-hv-sl-0051]
     host = 0051-phy-hv-sl
     mds_standby_for_name = pve

[mon.0047-phy-hv-sl]
     public_addr = 10.3.32.185

[mon.0048-phy-hv-sl]
     public_addr = 10.3.32.186

[mon.0049-phy-hv-sl]
     public_addr = 10.3.32.187

[mon.0050-phy-hv-sl]
     public_addr = 10.3.32.188

[mon.0051-phy-hv-sl]
     public_addr = 10.3.32.189
 
A few things that come to my mind:

Testing storage with dd is not a good idea. Rather use FIO. Check out the Ceph benchmark papers to get an idea on how to use it. Using dd from /dev/zero is definitely not a useful benchmark because many storages will not write them out fully but compress the zeros or just store them sparsely.

Which EVO SSDs do you use exactly?

Can you show us the network config as well? (/etc/network/interfaces)

The same subnet is configured for the mandatory Ceph public network and the optional Ceph cluster:
Code:
     cluster_network = 10.3.32.0/24
     public_network = 10.3.32.0/24

In this case, I'd rather remove the cluster network. A restart of the OSDs will be needed.

You can use ceph tell osd.X bench to quickly benchmark each OSD one by one. There might be some slow outliers that cost you overall performance.
 
Here's the current network interface on one of the nodes. I'd actually like the heavy lifting parts for ceph to go over the 10.3.32.0/24 network and utilize 10.3.34.0/24 for our front-end mgt/client network and heartbeat.

That would mean keeping cluster network as 10.3.32. and moving public to 10.3.34. correct?

The disks are Samsung 860 EVO's in this test bench setup.

Code:
auto lo
iface lo inet loopback

iface eno1 inet manual

iface eno1d1 inet manual
    mtu 9000

auto eno1d1.33
iface eno1d1.33 inet static
    address 10.3.32.185/24
    mtu 9000
#ceph

auto vmbr0
iface vmbr0 inet manual
    bridge-ports eno1
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes

auto vmbr0.33
iface vmbr0.33 inet static
    address 10.3.34.185/24
    gateway 10.3.34.1
 
Fixed up the cluster and public network, seperating them. Both on 40GbE mellanox. Reran tests with fio - these numbers are even worse.


Code:
fio: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.16
Starting 1 process
3;fio-3.16;fio;0;0;0;0;0;0;0;0;0.000000;0.000000;0;0;0.000000;0.000000;1.000000%=0;5.000000%=0;10.000000%=0;20.000000%=0;30.000000%=0;40.000000%=0;50.000000%=0;60.000000%=0;70.000000%=0;80.000000%=0;90.000000%=0;95.000000%=0;99.000000%=0;99.500000%=0;99.900000%=0;99.950000%=0;99.990000%=0;0%=0;0%=0;0%=0;0;0;0.000000;0.000000;0;0;0.000000%;0.000000;0.000000;18084;301;75;60008;28;310;34.135444;7.520405;8102;224289;13234.564075;5480.641932;1.000000%=9371;5.000000%=10289;10.000000%=10551;20.000000%=10813;30.000000%=11075;40.000000%=11730;50.000000%=12255;60.000000%=12910;70.000000%=13697;80.000000%=14876;90.000000%=16580;95.000000%=19005;99.000000%=26083;99.500000%=29491;99.900000%=65798;99.950000%=83361;99.990000%=223346;0%=0;0%=0;0%=0;8137;224337;13269.596564;5480.968261;64;392;100.000000%;301.308333;43.739257;0.089990%;0.241638%;9042;0;12;100.0%;0.0%;0.0%;0.0%;0.0%;0.0%;0.0%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;3.21%;92.81%;3.85%;0.09%;0.04%;0.00%;0.00%;0.00%;0.00%;0.00%;sdb;51;4512;0;0;3;59755;50068;99.91%
{
  "fio version" : "fio-3.16",
  "timestamp" : 1629151049,
  "timestamp_ms" : 1629151049648,
  "time" : "Mon Aug 16 21:57:29 2021",
  "global options" : {
    "ioengine" : "libaio",
    "filename" : "/dev/sdb",
    "direct" : "1",
    "sync" : "1",
    "rw" : "write",
    "bs" : "4K",
    "numjobs" : "1",
    "iodepth" : "1",
    "runtime" : "60"
  },
  "jobs" : [
    {
      "jobname" : "fio",
      "groupid" : 0,
      "error" : 0,
      "eta" : 0,
      "elapsed" : 61,
      "job options" : {
        "name" : "fio"
      },
      "read" : {
        "io_bytes" : 0,
        "io_kbytes" : 0,
        "bw_bytes" : 0,
        "bw" : 0,
        "iops" : 0.000000,
        "runtime" : 0,
        "total_ios" : 0,
        "short_ios" : 0,
        "drop_ios" : 0,
        "slat_ns" : {
          "min" : 0,
          "max" : 0,
          "mean" : 0.000000,
          "stddev" : 0.000000
        },
        "clat_ns" : {
          "min" : 0,
          "max" : 0,
          "mean" : 0.000000,
          "stddev" : 0.000000,
          "percentile" : {
            "1.000000" : 0,
            "5.000000" : 0,
            "10.000000" : 0,
            "20.000000" : 0,
            "30.000000" : 0,
            "40.000000" : 0,
            "50.000000" : 0,
            "60.000000" : 0,
            "70.000000" : 0,
            "80.000000" : 0,
            "90.000000" : 0,
            "95.000000" : 0,
            "99.000000" : 0,
            "99.500000" : 0,
            "99.900000" : 0,
            "99.950000" : 0,
            "99.990000" : 0
          }
        },
        "lat_ns" : {
          "min" : 0,
          "max" : 0,
          "mean" : 0.000000,
          "stddev" : 0.000000
        },
        "bw_min" : 0,
        "bw_max" : 0,
        "bw_agg" : 0.000000,
        "bw_mean" : 0.000000,
        "bw_dev" : 0.000000,
        "bw_samples" : 0,
        "iops_min" : 0,
        "iops_max" : 0,
        "iops_mean" : 0.000000,
        "iops_stddev" : 0.000000,
        "iops_samples" : 0
      },
      "write" : {
        "io_bytes" : 18518016,
        "io_kbytes" : 18084,
        "bw_bytes" : 308592,
        "bw" : 301,
        "iops" : 75.339955,
        "runtime" : 60008,
        "total_ios" : 4521,
        "short_ios" : 0,
        "drop_ios" : 0,
        "slat_ns" : {
          "min" : 28049,
          "max" : 310829,
          "mean" : 34135.443928,
          "stddev" : 7520.405289
        },
        "clat_ns" : {
          "min" : 8102367,
          "max" : 224289665,
          "mean" : 13234564.074983,
          "stddev" : 5480641.931638,
          "percentile" : {
            "1.000000" : 9371648,
            "5.000000" : 10289152,
            "10.000000" : 10551296,
            "20.000000" : 10813440,
            "30.000000" : 11075584,
            "40.000000" : 11730944,
            "50.000000" : 12255232,
            "60.000000" : 12910592,
            "70.000000" : 13697024,
            "80.000000" : 14876672,
            "90.000000" : 16580608,
            "95.000000" : 19005440,
            "99.000000" : 26083328,
            "99.500000" : 29491200,
            "99.900000" : 65798144,
            "99.950000" : 83361792,
            "99.990000" : 223346688
          }
        },
        "lat_ns" : {
          "min" : 8137103,
          "max" : 224337253,
          "mean" : 13269596.564256,
          "stddev" : 5480968.260932
        },
        "bw_min" : 64,
        "bw_max" : 392,
        "bw_agg" : 100.000000,
        "bw_mean" : 301.308333,
        "bw_dev" : 43.739257,
        "bw_samples" : 120,
        "iops_min" : 16,
        "iops_max" : 98,
        "iops_mean" : 75.308333,
        "iops_stddev" : 10.928950,
        "iops_samples" : 120
      },
      "trim" : {
        "io_bytes" : 0,
        "io_kbytes" : 0,
        "bw_bytes" : 0,
        "bw" : 0,
        "iops" : 0.000000,
        "runtime" : 0,
        "total_ios" : 0,
        "short_ios" : 0,
        "drop_ios" : 0,
        "slat_ns" : {
          "min" : 0,
          "max" : 0,
          "mean" : 0.000000,
          "stddev" : 0.000000
        },
        "clat_ns" : {
          "min" : 0,
          "max" : 0,
          "mean" : 0.000000,
          "stddev" : 0.000000,
          "percentile" : {
            "1.000000" : 0,
            "5.000000" : 0,
            "10.000000" : 0,
            "20.000000" : 0,
            "30.000000" : 0,
            "40.000000" : 0,
            "50.000000" : 0,
            "60.000000" : 0,
            "70.000000" : 0,
            "80.000000" : 0,
            "90.000000" : 0,
            "95.000000" : 0,
            "99.000000" : 0,
            "99.500000" : 0,
            "99.900000" : 0,
            "99.950000" : 0,
            "99.990000" : 0
          }
        },
        "lat_ns" : {
          "min" : 0,
          "max" : 0,
          "mean" : 0.000000,
          "stddev" : 0.000000
        },
        "bw_min" : 0,
        "bw_max" : 0,
        "bw_agg" : 0.000000,
        "bw_mean" : 0.000000,
        "bw_dev" : 0.000000,
        "bw_samples" : 0,
        "iops_min" : 0,
        "iops_max" : 0,
        "iops_mean" : 0.000000,
        "iops_stddev" : 0.000000,
        "iops_samples" : 0
      },
      "sync" : {
        "lat_ns" : {
          "min" : 0,
          "max" : 0,
          "mean" : 0.000000,
          "stddev" : 0.000000,
          "percentile" : {
            "1.000000" : 0,
            "5.000000" : 0,
            "10.000000" : 0,
            "20.000000" : 0,
            "30.000000" : 0,
            "40.000000" : 0,
            "50.000000" : 0,
            "60.000000" : 0,
            "70.000000" : 0,
            "80.000000" : 0,
            "90.000000" : 0,
            "95.000000" : 0,
            "99.000000" : 0,
            "99.500000" : 0,
            "99.900000" : 0,
            "99.950000" : 0,
            "99.990000" : 0
          }
        },
        "total_ios" : 0
      },
      "job_runtime" : 60007,
      "usr_cpu" : 0.089990,
      "sys_cpu" : 0.241638,
      "ctx" : 9042,
      "majf" : 0,
      "minf" : 12,
      "iodepth_level" : {
        "1" : 100.000000,
        "2" : 0.000000,
        "4" : 0.000000,
        "8" : 0.000000,
        "16" : 0.000000,
        "32" : 0.000000,
        ">=64" : 0.000000
      },
      "iodepth_submit" : {
        "0" : 0.000000,
        "4" : 100.000000,
        "8" : 0.000000,
        "16" : 0.000000,
        "32" : 0.000000,
        "64" : 0.000000,
        ">=64" : 0.000000
      },
      "iodepth_complete" : {
        "0" : 0.000000,
        "4" : 100.000000,
        "8" : 0.000000,
        "16" : 0.000000,
        "32" : 0.000000,
        "64" : 0.000000,
        ">=64" : 0.000000
      },
      "latency_ns" : {
        "2" : 0.000000,
        "4" : 0.000000,
        "10" : 0.000000,
        "20" : 0.000000,
        "50" : 0.000000,
        "100" : 0.000000,
        "250" : 0.000000,
        "500" : 0.000000,
        "750" : 0.000000,
        "1000" : 0.000000
      },
      "latency_us" : {
        "2" : 0.000000,
        "4" : 0.000000,
        "10" : 0.000000,
        "20" : 0.000000,
        "50" : 0.000000,
        "100" : 0.000000,
        "250" : 0.000000,
        "500" : 0.000000,
        "750" : 0.000000,
        "1000" : 0.000000
      },
      "latency_ms" : {
        "2" : 0.000000,
        "4" : 0.000000,
        "10" : 3.207255,
        "20" : 92.811325,
        "50" : 3.848706,
        "100" : 0.088476,
        "250" : 0.044238,
        "500" : 0.000000,
        "750" : 0.000000,
        "1000" : 0.000000,
        "2000" : 0.000000,
        ">=2000" : 0.000000
      },
      "latency_depth" : 1,
      "latency_target" : 0,
      "latency_percentile" : 100.000000,
      "latency_window" : 0
    }
  ],
  "disk_util" : [
    {
      "name" : "sdb",
      "read_ios" : 51,
      "write_ios" : 4512,
      "read_merges" : 0,
      "write_merges" : 0,
      "read_ticks" : 3,
      "write_ticks" : 59755,
      "in_queue" : 50068,
      "util" : 99.913276
    }
  ]
}

fio: (groupid=0, jobs=1): err= 0: pid=23592: Mon Aug 16 21:57:29 2021
  write: IOPS=75, BW=301KiB/s (309kB/s)(17.7MiB/60008msec); 0 zone resets
    slat (usec): min=28, max=310, avg=34.14, stdev= 7.52
    clat (msec): min=8, max=224, avg=13.23, stdev= 5.48
     lat (msec): min=8, max=224, avg=13.27, stdev= 5.48
    clat percentiles (msec):
     |  1.00th=[   10],  5.00th=[   11], 10.00th=[   11], 20.00th=[   11],
     | 30.00th=[   12], 40.00th=[   12], 50.00th=[   13], 60.00th=[   13],
     | 70.00th=[   14], 80.00th=[   15], 90.00th=[   17], 95.00th=[   20],
     | 99.00th=[   27], 99.50th=[   30], 99.90th=[   66], 99.95th=[   84],
     | 99.99th=[  224]
   bw (  KiB/s): min=   64, max=  392, per=100.00%, avg=301.31, stdev=43.74, samples=120
   iops        : min=   16, max=   98, avg=75.31, stdev=10.93, samples=120
  lat (msec)   : 10=3.21%, 20=92.81%, 50=3.85%, 100=0.09%, 250=0.04%
  cpu          : usr=0.09%, sys=0.24%, ctx=9042, majf=0, minf=12
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,4521,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=301KiB/s (309kB/s), 301KiB/s-301KiB/s (309kB/s-309kB/s), io=17.7MiB (18.5MB), run=60008-60008msec

Disk stats (read/write):
  sdb: ios=51/4512, merge=0/0, ticks=3/59755, in_queue=50068, util=99.91%

Code:
[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 10.3.34.185/21
     fsid = 972b4d17-8d71-4b77-8ed9-8e44e2c84f16
     mon_allow_pool_delete = true
     mon_host = 10.3.32.185 10.3.32.187 10.3.32.189 10.3.32.186 10.3.32.188
     ms_bind_ipv4 = true
     ms_bind_ipv6 = false
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 10.3.32.185/24
 
No SMART errors, these seem to be all over the place.. 1-2% wearout on each drive nearly brand new.

Code:
osd.0: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 2.2047925269999999,
    "bytes_per_sec": 487003566.48115581,
    "iops": 116.1106983378305
}
osd.1: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 3.0805641459999999,
    "bytes_per_sec": 348553632.7475,
    "iops": 83.101661860346795
}
osd.2: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 5.4624978860000004,
    "bytes_per_sec": 196566085.04177642,
    "iops": 46.865006695217232
}
osd.3: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 3.2401102019999999,
    "bytes_per_sec": 331390525.95717853,
    "iops": 79.009658326429971
}
osd.4: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 9.9026616930000007,
    "bytes_per_sec": 108429617.94393191,
    "iops": 25.85163544271753
}
osd.5: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 2.2762211240000001,
    "bytes_per_sec": 471721228.08223265,
    "iops": 112.46710493140999
}
osd.6: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 2.2555964479999999,
    "bytes_per_sec": 476034542.85985827,
    "iops": 113.49547931190926
}
osd.7: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 4.8104495329999999,
    "bytes_per_sec": 223210287.6527569,
    "iops": 53.21747962302134
}
osd.8: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 3.8966389270000001,
    "bytes_per_sec": 275555894.22463316,
    "iops": 65.69764476409749
}
osd.9: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 4.0852922520000003,
    "bytes_per_sec": 262831092.06552792,
    "iops": 62.663815513975123
}
osd.10: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 3.2478926010000002,
    "bytes_per_sec": 330596468.51296854,
    "iops": 78.820340278856406
}
osd.11: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 4.4772998919999996,
    "bytes_per_sec": 239819053.87185535,
    "iops": 57.177318065608823
}
osd.12: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 5.3780395460000001,
    "bytes_per_sec": 199653017.57563534,
    "iops": 47.600988763722263
}
osd.14: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 2.185408545,
    "bytes_per_sec": 491323156.23850548,
    "iops": 117.14056879007947
}
osd.15: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 4.1013513530000001,
    "bytes_per_sec": 261801960.27696922,
    "iops": 62.41845137523871
}
osd.16: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 4.1069355740000004,
    "bytes_per_sec": 261445986.83203009,
    "iops": 62.333580692298433
}
osd.17: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 4.0508964350000003,
    "bytes_per_sec": 265062768.50792903,
    "iops": 63.195888640386826
}
osd.18: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 3.3791237440000002,
    "bytes_per_sec": 317757473.63692874,
    "iops": 75.759285363418755
}
osd.19: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 4.7786535680000002,
    "bytes_per_sec": 224695473.04919845,
    "iops": 53.571575414943325
}
osd.20: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 23.645473035999999,
    "bytes_per_sec": 45410037.784621127,
    "iops": 10.826596685557634
}
osd.21: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 2.2305782999999999,
    "bytes_per_sec": 481373742.405725,
    "iops": 114.76844368117452
}
osd.22: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 5.0629263169999996,
    "bytes_per_sec": 212079291.06032062,
    "iops": 50.563643231468347
}
osd.24: {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "elapsed_sec": 2.5447438469999999,
    "bytes_per_sec": 421944953.42461872,
    "iops": 100.59951625457256
}
 
maybe a silly question, but did you enable jumbo frames on your switch?
 
Yup they are enabled. Iperf even shows full 40gbps between all nodes
 
I have the same issue, too. I'm homelab. Only two SSDs are used to obtain a write speed of 30MB/s.
If these two SSDs are tested separately, a sequential write speed of 1500MB/s can be obtained.
Code:
root@pve1:~# rados bench 60 write -p rbd
 
Total time run:         61.3636
Total writes made:      473
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     30.8326
Stddev Bandwidth:       13.8827
Max bandwidth (MB/sec): 64
Min bandwidth (MB/sec): 0
Average IOPS:           7
Stddev IOPS:            3.47111
Max IOPS:               16
Min IOPS:               0
Average Latency(s):     2.07007
Stddev Latency(s):      1.22122
Max latency(s):         16.0428
Min latency(s):         0.860319
Cleaning up (deleting benchmark objects)
Removed 473 objects
Clean up completed and total clean up time :1.19631
root@pve1:~# rbd bench --io-type write --io-size 4M test
bench  type write io_size 4194304 io_threads 16 bytes 1073741824 pattern sequential
  SEC       OPS   OPS/SEC   BYTES/SEC
    1        48   72.0731   288 MiB/s
    2        80   36.4747   146 MiB/s
    3        96   30.9739   124 MiB/s
    4       112   28.1694   113 MiB/s
    5       128   26.4127   106 MiB/s
    6       144   17.3038    69 MiB/s
    7       160   17.1235    68 MiB/s
    8       176   16.7928    67 MiB/s
    9       192   16.4071    66 MiB/s
   10       208   16.6253    67 MiB/s
   11       224   15.8105    63 MiB/s
   12       240   15.4681    62 MiB/s
elapsed: 15   ops: 256   ops/sec: 16.7542   bytes/sec: 67 MiB/s
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!