[SOLVED] CEPH IOPS dropped by more than 50% after upgrade from Nautilus 14.2.22 to Octopus 15.2.15

This means, that your HW setup runs with NPS1 (see page 10 at https://developer.amd.com/wp-content/resources/56745_0.80.pdf) for details.
I'm not too much of an expert in the nasty details of the AMD Epyc NUMA architecture but would say, that is is not your bottleneck (might give you some extra percent though, if optimized).
Thanks, I know about this document and I don't find NUMA architecture in EPYC (Zen 2) unpleasant, on the contrary - a lot of customization options make it flexible.
About Ceph, current bench show next (also not any tunings, just waited few days):
Code:
# rados bench -p bench 30 -t 256 -b 1024 write
hints = 1
Maintaining 256 concurrent writes of 1024 bytes to objects of size 1024 for up to 30 seconds or 0 objects
...
Total time run:         30.0526
Total writes made:      837823
Write size:             1024
Object size:            1024
Bandwidth (MB/sec):     27.2252
Stddev Bandwidth:       7.85848
Max bandwidth (MB/sec): 47.8105
Min bandwidth (MB/sec): 17.3945
Average IOPS:           27878
Stddev IOPS:            8047.08
Max IOPS:               48958
Min IOPS:               17812
Average Latency(s):     0.00917077
Stddev Latency(s):      0.017061
Max latency(s):         0.403896
Min latency(s):         0.000575539
Cleaning up (deleting benchmark objects)
Removed 837823 objects
Clean up completed and total clean up time :22.7427

Code:
# rados bench -p bench 60 -t 256 -b 4096 write
hints = 1
Maintaining 256 concurrent writes of 4096 bytes to objects of size 4096 for up to 60 seconds or 0 objects
...
Total time run:         60.1782
Total writes made:      1754802
Write size:             4096
Object size:            4096
Bandwidth (MB/sec):     113.907
Stddev Bandwidth:       42.8305
Max bandwidth (MB/sec): 277.535
Min bandwidth (MB/sec): 57.9297
Average IOPS:           29160
Stddev IOPS:            10964.6
Max IOPS:               71049
Min IOPS:               14830
Average Latency(s):     0.00875424
Stddev Latency(s):      0.0201084
Max latency(s):         0.784958
Min latency(s):         0.000596609
Cleaning up (deleting benchmark objects)
Removed 1754802 objects
Clean up completed and total clean up time :46.5646

Code:
# rados bench -p bench 60 -t 2048 -b 4096 write
hints = 1
Maintaining 2048 concurrent writes of 4096 bytes to objects of size 4096 for up to 60 seconds or 0 objects
...
Total time run:         60.3087
Total writes made:      2227620
Write size:             4096
Object size:            4096
Bandwidth (MB/sec):     144.285
Stddev Bandwidth:       20.7965
Max bandwidth (MB/sec): 190.551
Min bandwidth (MB/sec): 84.3906
Average IOPS:           36936
Stddev IOPS:            5323.9
Max IOPS:               48781
Min IOPS:               21604
Average Latency(s):     0.0551987
Stddev Latency(s):      0.0404138
Max latency(s):         1.12137
Min latency(s):         0.00140223
Cleaning up (deleting benchmark objects)
Removed 2227620 objects
Clean up completed and total clean up time :42.1453
 
Last edited:
Ahh perfect. This means after waiting a few days, your performance came back to normal?
I don't know what the norm should be, I can only suggest. But performance has doubled and allows the use of CEPH in prod.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!