bad ceph performance on SSD

kifeo · Sep 24, 2020

Hello !

I have this setup of proxmox and ceph :

Code:

root@proxmox1:~# ceph osd tree
ID  CLASS WEIGHT   TYPE NAME         STATUS REWEIGHT PRI-AFF
 -1       18.19080 root default                             
 -7        0.90970     host proxmox1                         
  5   ssd  0.90970         osd.5         up  1.00000 1.00000
 -3        5.45695     host proxmox2                         
  0   hdd  1.81898         osd.0         up  1.00000 1.00000
  1   hdd  1.81898         osd.1         up  1.00000 1.00000
  2   hdd  1.81898         osd.2         up  1.00000 1.00000
-16        8.18535     host proxmox3                         
  3   hdd  1.81940         osd.3         up  1.00000 1.00000
 10   hdd  1.81898         osd.10        up  1.00000 1.00000
 11   hdd  1.81898         osd.11        up  1.00000 1.00000
 12   hdd  1.81898         osd.12        up  1.00000 1.00000
  9   ssd  0.90900         osd.9         up  1.00000 1.00000
-13        0.90970     host proxmox4                         
  4   ssd  0.90970         osd.4         up  1.00000 1.00000
-10        2.72910     host proxmox5                         
  6   hdd  1.81940         osd.6         up  1.00000 1.00000
  7   ssd  0.90970         osd.7         up  1.00000 1.00000

I expected issues on my Vms that are on the ssdpool (and hdd pool is for archive non speed related rbd).
When I performed a benchmark, I was really disapointed :

Code:

root@proxmox1:~# rados bench 60 write -p ssdpool
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 60 seconds or 0 objects
Object prefix: benchmark_data_proxmox1_3691
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16        16         0         0         0           -           0
    2      16        23         7   13.9988        14    0.440324      1.1027
    3      16        23         7    9.3325         0           -      1.1027
    4      16        34        18   17.9984        22      2.5665     2.60108
    5      16        42        26   20.7981        32    0.246745     1.96686
    6      16        44        28   18.6649         8     4.71746     2.16382
    7      16        46        30   17.1412         8    0.223545     2.03352
    8      16        46        30   14.9984         0           -     2.03352
    9      16        46        30   13.3317         0           -     2.03352
   10      16        46        30   11.9985         0           -     2.03352
   11      16        46        30   10.9077         0           -     2.03352
   12      16        46        30   9.99878         0           -     2.03352
   13      16        46        30   9.22959         0           -     2.03352
   14      16        46        30   8.57028         0           -     2.03352
   15      16        46        30   7.99895         0           -     2.03352
   16      16        46        30   7.49901         0           -     2.03352
   17      16        46        30   7.05791         0           -     2.03352

whereas with the hdd pool (so spinning disks only) :

Code:

root@proxmox1:~# rados bench 60 write -p hddpool
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 60 seconds or 0 objects
Object prefix: benchmark_data_proxmox1_3076
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16        28        12    47.999        48    0.940427    0.739306
    2      16        51        35   69.9969        92    0.722352    0.742403
    3      16        71        55   73.3288        80    0.489085    0.680408
    4      16        93        77   76.9947        88    0.325764    0.600826
    5      16       103        87   69.5933        40    0.277885    0.565222
    6      16       104        88    58.661         4    0.180992    0.560855
    7      16       104        88   50.2806         0           -    0.560855
    8      16       129       113   56.4942        50    0.821083     1.06378
    9      16       153       137   60.8822        96     0.45285     0.99641
   10      16       171       155   61.9932        72    0.420413    0.936436
   11      16       188       172   62.5386        68    0.321447    0.876192
   12      16       202       186   61.9933        56     3.64751    0.889077
   13      16       225       209   64.3008        92     3.92119    0.937779
   14      16       247       231   65.9925        88    0.400445    0.936914
   15      16       273       257   68.5256       104    0.405007    0.913546
   16      16       296       280   69.9923        92    0.477598    0.889124
   17      16       318       302   71.0511        88     1.10232    0.881567
   18      16       336       320   71.1034        72    0.520223    0.862996

Could you help me on finding the issue and resolving it ?

Thanks

Alwin · Sep 24, 2020

Please give us a little more insight on how you setup the cluster and what hardware you used.

kifeo · Sep 24, 2020

proxmox1 is a NUC8i3, proxmox2 and proxmox3 are N54L, proxmox4 and proxmox5 are dell 8200 SFF i7.

the SSD are all 1To.
Those pc are all connected to the same Gigabit switch and up to date.

What other info would you like ?

Alwin · Sep 24, 2020

Just ahead, there is not much to be expected from that hardware.

kifeo said:
Those pc are all connected to the same Gigabit switch and up to date.

The hddpool already uses most of the bandwidth and there is not much room for faster SSDs.

kifeo said:
the SSD are all 1To.

Depending on model the SSDs might just perform like the spinners.

kifeo · Sep 24, 2020

My understanding is not clear on why the HDD perform much faster than the SSD, the hddpool is almost not used.
I would expect around the same speed with both disks

spirit · Sep 24, 2020

what is your ssd model ?

kifeo · Sep 25, 2020

SanDisk SDSSDP06
SanDisk SDSSDP12
Samsung SSD 850
and two mvne on pcie port

Alwin · Sep 25, 2020

kifeo said:
My understanding is not clear on why the HDD perform much faster than the SSD, the hddpool is almost not used.

Check your network saturation. Replication and client traffic share the 1 GbE (if not public/cluster separated).

kifeo · Sep 25, 2020

Thanks for the reply, however, if the network would be the cause, shouln't it affect both pools during the benchmark ?

Alwin · Sep 25, 2020

kifeo said:
Thanks for the reply, however, if the network would be the cause, shouln't it affect both pools during the benchmark ?

I meant before as well as during the benchmark. A network congestion can lead to those drops.

And do the pools use different crush rules?

kifeo · Sep 25, 2020

I tried multiple times and on different days, I have always the same results.

yes the crushrules are different, the difference is the hdd/sdd class

Code:

rule replicated_hdd {
    id 1
    type replicated
    min_size 2
    max_size 10
    step take default class hdd
    step chooseleaf firstn 0 type host
    step emit
}
rule replicated_ssd {
    id 2
    type replicated
    min_size 2
    max_size 10
    step take default class ssd
    step chooseleaf firstn 0 type host
    step emit
}

Alwin · Sep 25, 2020

kifeo said:
yes the crushrules are different, the difference is the hdd/sdd class

Can you please post a ceph osd dump? A ceph df detail and ceph osd df tree would be nice as well.

kifeo said:
I tried multiple times and on different days, I have always the same results.

What is the saturation of your network before and during the benchmark?

kifeo · Sep 25, 2020

network is not saturated, first spike is with hdd (around 100M, second is with sdd around 50M), max seen on interface is 400Mb during VM migrations

ceph osd dump

Code:

osd.0 up   in  weight 1 up_from 31272 up_thru 32880 down_at 31271 last_clean_interval [30448,31269) [v2:192.168.1.10:6816/1037466,v1:192.168.1.10:6817/1037466] [v2:192.168.1.10:6818/1037466,v1:192.168.1.10:6819/1037466] exists,up 5fc96b0a-3c3b-405d-aed1-15a445f8400d
osd.1 up   in  weight 1 up_from 31268 up_thru 32880 down_at 31267 last_clean_interval [30450,31266) [v2:192.168.1.10:6800/1036685,v1:192.168.1.10:6801/1036685] [v2:192.168.1.10:6803/1036685,v1:192.168.1.10:6805/1036685] exists,up 843192e0-6665-4d81-b655-aefaafbf2cd3
osd.2 up   in  weight 1 up_from 31265 up_thru 32881 down_at 31264 last_clean_interval [30450,31263) [v2:192.168.1.10:6802/1035783,v1:192.168.1.10:6804/1035783] [v2:192.168.1.10:6806/1035783,v1:192.168.1.10:6808/1035783] exists,up d6e9f6c8-6347-427a-97c6-69b1dcdfaec0
osd.3 up   in  weight 1 up_from 31262 up_thru 32881 down_at 31261 last_clean_interval [30454,31260) [v2:192.168.1.9:6816/3884208,v1:192.168.1.9:6817/3884208] [v2:192.168.1.9:6818/3884208,v1:192.168.1.9:6819/3884208] exists,up 76b29428-d98e-47ce-ad72-764fc15d17e4
osd.4 up   in  weight 1 up_from 32867 up_thru 32881 down_at 32861 last_clean_interval [32685,32865) [v2:192.168.1.5:6800/2415482,v1:192.168.1.5:6801/2415482] [v2:192.168.1.5:6808/5415482,v1:192.168.1.5:6809/5415482] exists,up 050bb534-28c0-41fe-9240-7d77fa87aa65
osd.5 up   in  weight 1 up_from 32881 up_thru 32881 down_at 32880 last_clean_interval [31755,32880) [v2:192.168.1.199:6802/968,v1:192.168.1.199:6803/968] [v2:192.168.1.199:6810/17000968,v1:192.168.1.199:6811/17000968] exists,up a2828e4b-7679-4977-a4eb-bdebc66b324f
osd.6 up   in  weight 1 up_from 32347 up_thru 32347 down_at 32345 last_clean_interval [31244,32346) [v2:192.168.1.7:6807/2773093,v1:192.168.1.7:6810/2773093] [v2:192.168.1.7:6811/25773093,v1:192.168.1.7:6813/25773093] exists,up 8b6da74b-bcaa-4a99-929c-351f8f9202ec
osd.7 up   in  weight 1 up_from 32654 up_thru 32881 down_at 32652 last_clean_interval [31241,32653) [v2:192.168.1.7:6800/2772635,v1:192.168.1.7:6801/2772635] [v2:192.168.1.7:6802/61772635,v1:192.168.1.7:6803/61772635] exists,up 1f2ab5e6-3ef1-45f8-8c3f-246255a97fc3
osd.9 up   in  weight 1 up_from 31250 up_thru 32881 down_at 31249 last_clean_interval [30447,31248) [v2:192.168.1.9:6824/3880644,v1:192.168.1.9:6825/3880644] [v2:192.168.1.9:6826/3880644,v1:192.168.1.9:6828/3880644] exists,up 19f81414-7b43-4ad3-81f2-ac691101c9cb
osd.10 up   in  weight 1 up_from 31259 up_thru 32881 down_at 31258 last_clean_interval [30450,31257) [v2:192.168.1.9:6800/3883569,v1:192.168.1.9:6801/3883569] [v2:192.168.1.9:6802/3883569,v1:192.168.1.9:6803/3883569] exists,up 55a0f60c-ae89-45e9-9c6e-857d240b98c0
osd.11 up   in  weight 1 up_from 31256 up_thru 32881 down_at 31255 last_clean_interval [30447,31251) [v2:192.168.1.9:6808/3881805,v1:192.168.1.9:6809/3881805] [v2:192.168.1.9:6810/3881805,v1:192.168.1.9:6811/3881805] exists,up 96a6d843-51df-430e-ab0d-fa1e487ad069
osd.12 up   in  weight 1 up_from 31250 up_thru 32347 down_at 31249 last_clean_interval [30450,31248) [v2:192.168.1.9:6827/3880421,v1:192.168.1.9:6829/3880421] [v2:192.168.1.9:6831/3880421,v1:192.168.1.9:6833/3880421] exists,up 36c41f06-4183-4e12-b0e1-6d6b2a7846af

Magneto · Apr 6, 2021

Did you ever get to the bottom of this?

kifeo · Apr 6, 2021

not at all

my last clue that was my setup was undersized.
SSD i/o was too numberous generating more iowait and having snowball effect ...
the main diff was the HDD was having less i/o due to the nature of the storage (it was due to store only less frequent accessible data).

I shut down my kubernetes cluster nodes and pods on top of it and the i/o have gone, but now I have no use of ceph ... so I wonder if I would require more powerfull hardware (not network as not saturated) or more nodes ?

by the way I always got the answer from other: your setup is not powerfull enought without any proof or explaination.

Search

Search

bad ceph performance on SSD

kifeo

Well-Known Member

Alwin

Proxmox Retired Staff

kifeo

Well-Known Member

Alwin

Proxmox Retired Staff

kifeo

Well-Known Member

spirit

Distinguished Member

kifeo

Well-Known Member

Alwin

Proxmox Retired Staff

kifeo

Well-Known Member

Alwin

Proxmox Retired Staff

kifeo

Well-Known Member

Alwin

Proxmox Retired Staff

kifeo

Well-Known Member

Magneto

Well-Known Member

kifeo

Well-Known Member

We value your privacy