bad ceph performance on SSD

kifeo

Active Member
Oct 28, 2019
108
10
38
Hello !

I have this setup of proxmox and ceph :
Code:
root@proxmox1:~# ceph osd tree
ID  CLASS WEIGHT   TYPE NAME         STATUS REWEIGHT PRI-AFF
 -1       18.19080 root default                             
 -7        0.90970     host proxmox1                         
  5   ssd  0.90970         osd.5         up  1.00000 1.00000
 -3        5.45695     host proxmox2                         
  0   hdd  1.81898         osd.0         up  1.00000 1.00000
  1   hdd  1.81898         osd.1         up  1.00000 1.00000
  2   hdd  1.81898         osd.2         up  1.00000 1.00000
-16        8.18535     host proxmox3                         
  3   hdd  1.81940         osd.3         up  1.00000 1.00000
 10   hdd  1.81898         osd.10        up  1.00000 1.00000
 11   hdd  1.81898         osd.11        up  1.00000 1.00000
 12   hdd  1.81898         osd.12        up  1.00000 1.00000
  9   ssd  0.90900         osd.9         up  1.00000 1.00000
-13        0.90970     host proxmox4                         
  4   ssd  0.90970         osd.4         up  1.00000 1.00000
-10        2.72910     host proxmox5                         
  6   hdd  1.81940         osd.6         up  1.00000 1.00000
  7   ssd  0.90970         osd.7         up  1.00000 1.00000


I expected issues on my Vms that are on the ssdpool (and hdd pool is for archive non speed related rbd).
When I performed a benchmark, I was really disapointed :

Code:
root@proxmox1:~# rados bench 60 write -p ssdpool
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 60 seconds or 0 objects
Object prefix: benchmark_data_proxmox1_3691
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16        16         0         0         0           -           0
    2      16        23         7   13.9988        14    0.440324      1.1027
    3      16        23         7    9.3325         0           -      1.1027
    4      16        34        18   17.9984        22      2.5665     2.60108
    5      16        42        26   20.7981        32    0.246745     1.96686
    6      16        44        28   18.6649         8     4.71746     2.16382
    7      16        46        30   17.1412         8    0.223545     2.03352
    8      16        46        30   14.9984         0           -     2.03352
    9      16        46        30   13.3317         0           -     2.03352
   10      16        46        30   11.9985         0           -     2.03352
   11      16        46        30   10.9077         0           -     2.03352
   12      16        46        30   9.99878         0           -     2.03352
   13      16        46        30   9.22959         0           -     2.03352
   14      16        46        30   8.57028         0           -     2.03352
   15      16        46        30   7.99895         0           -     2.03352
   16      16        46        30   7.49901         0           -     2.03352
   17      16        46        30   7.05791         0           -     2.03352


whereas with the hdd pool (so spinning disks only) :
Code:
root@proxmox1:~# rados bench 60 write -p hddpool
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 60 seconds or 0 objects
Object prefix: benchmark_data_proxmox1_3076
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16        28        12    47.999        48    0.940427    0.739306
    2      16        51        35   69.9969        92    0.722352    0.742403
    3      16        71        55   73.3288        80    0.489085    0.680408
    4      16        93        77   76.9947        88    0.325764    0.600826
    5      16       103        87   69.5933        40    0.277885    0.565222
    6      16       104        88    58.661         4    0.180992    0.560855
    7      16       104        88   50.2806         0           -    0.560855
    8      16       129       113   56.4942        50    0.821083     1.06378
    9      16       153       137   60.8822        96     0.45285     0.99641
   10      16       171       155   61.9932        72    0.420413    0.936436
   11      16       188       172   62.5386        68    0.321447    0.876192
   12      16       202       186   61.9933        56     3.64751    0.889077
   13      16       225       209   64.3008        92     3.92119    0.937779
   14      16       247       231   65.9925        88    0.400445    0.936914
   15      16       273       257   68.5256       104    0.405007    0.913546
   16      16       296       280   69.9923        92    0.477598    0.889124
   17      16       318       302   71.0511        88     1.10232    0.881567
   18      16       336       320   71.1034        72    0.520223    0.862996


Could you help me on finding the issue and resolving it ?

Thanks
 
Please give us a little more insight on how you setup the cluster and what hardware you used.
 
proxmox1 is a NUC8i3, proxmox2 and proxmox3 are N54L, proxmox4 and proxmox5 are dell 8200 SFF i7.

the SSD are all 1To.
Those pc are all connected to the same Gigabit switch and up to date.

What other info would you like ?
 
Just ahead, there is not much to be expected from that hardware.

Those pc are all connected to the same Gigabit switch and up to date.
The hddpool already uses most of the bandwidth and there is not much room for faster SSDs.

the SSD are all 1To.
Depending on model the SSDs might just perform like the spinners.
 
My understanding is not clear on why the HDD perform much faster than the SSD, the hddpool is almost not used.
I would expect around the same speed with both disks
 
SanDisk SDSSDP06
SanDisk SDSSDP12
Samsung SSD 850
and two mvne on pcie port
 
My understanding is not clear on why the HDD perform much faster than the SSD, the hddpool is almost not used.
Check your network saturation. Replication and client traffic share the 1 GbE (if not public/cluster separated).
 
Thanks for the reply, however, if the network would be the cause, shouln't it affect both pools during the benchmark ?
 
Thanks for the reply, however, if the network would be the cause, shouln't it affect both pools during the benchmark ?
I meant before as well as during the benchmark. A network congestion can lead to those drops.

And do the pools use different crush rules?
 
I tried multiple times and on different days, I have always the same results.

yes the crushrules are different, the difference is the hdd/sdd class

Code:
rule replicated_hdd {
    id 1
    type replicated
    min_size 2
    max_size 10
    step take default class hdd
    step chooseleaf firstn 0 type host
    step emit
}
rule replicated_ssd {
    id 2
    type replicated
    min_size 2
    max_size 10
    step take default class ssd
    step chooseleaf firstn 0 type host
    step emit
}
 
yes the crushrules are different, the difference is the hdd/sdd class
Can you please post a ceph osd dump? A ceph df detail and ceph osd df tree would be nice as well.

I tried multiple times and on different days, I have always the same results.
What is the saturation of your network before and during the benchmark?
 
network is not saturated, first spike is with hdd (around 100M, second is with sdd around 50M), max seen on interface is 400Mb during VM migrations
1601044290788.png


ceph osd dump
Code:
osd.0 up   in  weight 1 up_from 31272 up_thru 32880 down_at 31271 last_clean_interval [30448,31269) [v2:192.168.1.10:6816/1037466,v1:192.168.1.10:6817/1037466] [v2:192.168.1.10:6818/1037466,v1:192.168.1.10:6819/1037466] exists,up 5fc96b0a-3c3b-405d-aed1-15a445f8400d
osd.1 up   in  weight 1 up_from 31268 up_thru 32880 down_at 31267 last_clean_interval [30450,31266) [v2:192.168.1.10:6800/1036685,v1:192.168.1.10:6801/1036685] [v2:192.168.1.10:6803/1036685,v1:192.168.1.10:6805/1036685] exists,up 843192e0-6665-4d81-b655-aefaafbf2cd3
osd.2 up   in  weight 1 up_from 31265 up_thru 32881 down_at 31264 last_clean_interval [30450,31263) [v2:192.168.1.10:6802/1035783,v1:192.168.1.10:6804/1035783] [v2:192.168.1.10:6806/1035783,v1:192.168.1.10:6808/1035783] exists,up d6e9f6c8-6347-427a-97c6-69b1dcdfaec0
osd.3 up   in  weight 1 up_from 31262 up_thru 32881 down_at 31261 last_clean_interval [30454,31260) [v2:192.168.1.9:6816/3884208,v1:192.168.1.9:6817/3884208] [v2:192.168.1.9:6818/3884208,v1:192.168.1.9:6819/3884208] exists,up 76b29428-d98e-47ce-ad72-764fc15d17e4
osd.4 up   in  weight 1 up_from 32867 up_thru 32881 down_at 32861 last_clean_interval [32685,32865) [v2:192.168.1.5:6800/2415482,v1:192.168.1.5:6801/2415482] [v2:192.168.1.5:6808/5415482,v1:192.168.1.5:6809/5415482] exists,up 050bb534-28c0-41fe-9240-7d77fa87aa65
osd.5 up   in  weight 1 up_from 32881 up_thru 32881 down_at 32880 last_clean_interval [31755,32880) [v2:192.168.1.199:6802/968,v1:192.168.1.199:6803/968] [v2:192.168.1.199:6810/17000968,v1:192.168.1.199:6811/17000968] exists,up a2828e4b-7679-4977-a4eb-bdebc66b324f
osd.6 up   in  weight 1 up_from 32347 up_thru 32347 down_at 32345 last_clean_interval [31244,32346) [v2:192.168.1.7:6807/2773093,v1:192.168.1.7:6810/2773093] [v2:192.168.1.7:6811/25773093,v1:192.168.1.7:6813/25773093] exists,up 8b6da74b-bcaa-4a99-929c-351f8f9202ec
osd.7 up   in  weight 1 up_from 32654 up_thru 32881 down_at 32652 last_clean_interval [31241,32653) [v2:192.168.1.7:6800/2772635,v1:192.168.1.7:6801/2772635] [v2:192.168.1.7:6802/61772635,v1:192.168.1.7:6803/61772635] exists,up 1f2ab5e6-3ef1-45f8-8c3f-246255a97fc3
osd.9 up   in  weight 1 up_from 31250 up_thru 32881 down_at 31249 last_clean_interval [30447,31248) [v2:192.168.1.9:6824/3880644,v1:192.168.1.9:6825/3880644] [v2:192.168.1.9:6826/3880644,v1:192.168.1.9:6828/3880644] exists,up 19f81414-7b43-4ad3-81f2-ac691101c9cb
osd.10 up   in  weight 1 up_from 31259 up_thru 32881 down_at 31258 last_clean_interval [30450,31257) [v2:192.168.1.9:6800/3883569,v1:192.168.1.9:6801/3883569] [v2:192.168.1.9:6802/3883569,v1:192.168.1.9:6803/3883569] exists,up 55a0f60c-ae89-45e9-9c6e-857d240b98c0
osd.11 up   in  weight 1 up_from 31256 up_thru 32881 down_at 31255 last_clean_interval [30447,31251) [v2:192.168.1.9:6808/3881805,v1:192.168.1.9:6809/3881805] [v2:192.168.1.9:6810/3881805,v1:192.168.1.9:6811/3881805] exists,up 96a6d843-51df-430e-ab0d-fa1e487ad069
osd.12 up   in  weight 1 up_from 31250 up_thru 32347 down_at 31249 last_clean_interval [30450,31248) [v2:192.168.1.9:6827/3880421,v1:192.168.1.9:6829/3880421] [v2:192.168.1.9:6831/3880421,v1:192.168.1.9:6833/3880421] exists,up 36c41f06-4183-4e12-b0e1-6d6b2a7846af
 
Last edited:
not at all

my last clue that was my setup was undersized.
SSD i/o was too numberous generating more iowait and having snowball effect ...
the main diff was the HDD was having less i/o due to the nature of the storage (it was due to store only less frequent accessible data).

I shut down my kubernetes cluster nodes and pods on top of it and the i/o have gone, but now I have no use of ceph ... so I wonder if I would require more powerfull hardware (not network as not saturated) or more nodes ?

by the way I always got the answer from other: your setup is not powerfull enought without any proof or explaination.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!