ceph osd poor performance, one per node

MACscr

Member
Mar 19, 2013
95
3
8
I have 3 ceph storage nodes with only 3 ssd's each for storage. They are only on sata II links, so they max out at about 141MB/s. I am fine with that, but I have 1 osd on each node that has absolutely awful performance and i have no idea why. Seems to be osd.0, osd.3, osd.4 that are just awful. Each OSD does have a 5GB partition as you can see in the fdisk output. Any suggestions are appreciated.

Here is a bit of info about the setup:

Code:
root@stor1:~# ceph osd perf
osd fs_commit_latency(ms) fs_apply_latency(ms)
  0                    28                   29
  1                     5                    6
  2                     6                    7
  3                    82                   83
  4                    72                   73
  5                    34                   35
  6                     9                    9
  7                     1                    2
  8                     5                    6

root@stor1:~# ceph tell osd.* bench -f plain
osd.0: bench: wrote 1024 MB in blocks of 4096 kB in 105.317900 sec at 9956 kB/sec
osd.1: bench: wrote 1024 MB in blocks of 4096 kB in 8.698469 sec at 117 MB/sec
osd.2: bench: wrote 1024 MB in blocks of 4096 kB in 7.977820 sec at 128 MB/sec
osd.3: bench: wrote 1024 MB in blocks of 4096 kB in 139.787650 sec at 7501 kB/sec
osd.4: bench: wrote 1024 MB in blocks of 4096 kB in 69.808043 sec at 15020 kB/sec
osd.5: bench: wrote 1024 MB in blocks of 4096 kB in 8.102800 sec at 126 MB/sec
osd.6: bench: wrote 1024 MB in blocks of 4096 kB in 7.765746 sec at 131 MB/sec
osd.7: bench: wrote 1024 MB in blocks of 4096 kB in 6.968641 sec at 146 MB/sec
osd.8: bench: wrote 1024 MB in blocks of 4096 kB in 6.914912 sec at 148 MB/sec

root@stor1:~# fdisk -l /dev/sda

Disk /dev/sda: 477 GiB, 512110190592 bytes, 1000215216 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 5970E0FA-79F4-4A71-A9BC-62884B711CB4

Device        Start        End   Sectors  Size Type
/dev/sda1  10487808 1000215182 989727375  472G unknown
/dev/sda2      2048   10485760  10483713    5G unknown

Partition table entries are not in disk order.
root@stor1:~#
root@stor1:~# ceph osd tree
ID WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 4.15991 root default
-2 1.37997     host stor1
 0 0.45999         osd.0       up  1.00000          1.00000
 1 0.45999         osd.1       up  1.00000          1.00000
 6 0.45999         osd.6       up  1.00000          1.00000
-3 1.38997     host stor2
 2 0.45999         osd.2       up  1.00000          1.00000
 3 0.45999         osd.3       up  1.00000          1.00000
 7 0.46999         osd.7       up  1.00000          1.00000
-4 1.38997     host stor3
 4 0.45999         osd.4       up  1.00000          1.00000
 5 0.45999         osd.5       up  1.00000          1.00000
 8 0.46999         osd.8       up  1.00000          1.00000
root@stor1:~#
 
It's really depend.

follow this blog to test with fio. (you need to test direct sync write, some consumer ssd are pretty bad for this)

https://www.sebastien-han.fr/blog/2...-if-your-ssd-is-suitable-as-a-journal-device/

Thanks for that. Unfortunately I cant run those tests on drives that are already in "production" as they wipe the drive. That is very interesting to know though. If all of my drives are consumer and mostly the same model, why would i be getting such drastic differences on that perf test for just those 3? Are those 3 acting as the journal for all osd's on each node when I run the test? Im fine with the 141MB/s as i mentioned, but the 7MB/s and such are what are worrying me.
 
Unfortunately I cant run those tests on drives that are already in "production" as they wipe the drive.
As it's Ceph, you can remove a OSD to do the test and re-create the OSD. ;)

Another question that arises, are those SSDs connected to a RAID controller or HBA? And what model are those SSDs?
 
yes, but a recovery hurts performance with so little osd's. Anyway, they are Crucial MX100 and MX300's. They are connected to raid controllers (LSI SAS1068e) that can act as HBAs (required specific firmware, was a a few years ago), so full access is given to the disk.
 
Anyway, they are Crucial MX100 and MX300's.
That may well be the difference. I suspect that OSD 0,3,4,5 are all the same type?

raid controllers (LSI SAS1068e) that can act as HBAs
Is it in IT-Mode (HBA)? If not, it should be as the controller interferes with Ceph.
https://pve.proxmox.com/pve-docs/chapter-pveceph.html#_precondition

Further, you could change the OSDs from filestore to bluestore (if recent enough Ceph version), this will remove the double write penalty of the OSDs.
 
Ah yes, IT-Mode, that was the name of it and the mode they are in.

OSD 0 is now osd 9, but here is just a sample and should be similiar with the other two nodes. two mx100s and 1 mx300

root@stor1:/etc/pve# hdparm -I /dev/sda | grep Model
Model Number: Crucial_CT512MX100SSD1
root@stor1:/etc/pve# hdparm -I /dev/sdb | grep Model
Model Number: Crucial_CT512MX100SSD1
root@stor1:/etc/pve# hdparm -I /dev/sdc | grep Model
Model Number: Crucial_CT525MX300SSD1
root@stor1:/etc/pve# mount | grep osd
/dev/sdc1 on /var/lib/ceph/osd/ceph-6 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdb1 on /var/lib/ceph/osd/ceph-1 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sda1 on /var/lib/ceph/osd/ceph-9 type xfs (rw,noatime,attr2,inode64,noquota)
 
As I can see the slow disks are all MX100? Have you checked the SMART status of the disk? Do you have an monitoring which checks if you have CRC Error or something else, so the cable or backplane might be the problem? Have you tried to swap the bays, to see if it's the disk or bay?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!