Typical Ceph OSD Commit Latency for HDDs?

bqq100

Member
Jun 8, 2021
16
0
6
40
What is the typical commit latency I should expect from HDDs in a cluster?

I'm currently in the process of migrating from my ZFS pools to a Ceph pools. For the initial migration I have a random mix of hardware. Once the migration is complete, I'll shutdown the ZFS pools and migrate that hardware over to the ceph pools. I have both SSD and HDD OSDs/replication rules/pools with some pools running 3 copies and some running 2 copies, depending on how critical the data is and my uptime requirements.

My SSD OSDs/pools seem reasonably fast and I have no major issues, but my HDD OSDs are all over the place. Right now I see 6 OSDs running 55-85ms, but 2 running in the low 200's with spikes up to 300ms. And 1 OSD is running at <10ms (and this is a USB HDD setup and it's the only OSD on a 1G connection!). One of the 2 slow disks I can explain because it is currently running double duty for ZFS/Ceph. The other disk though is a dedicated WD Red CMR 4TB disk. The hours on the disk are pretty high, but SMART doesn't show any issues and it constantly passes short/extended tests.

Is the 55-85ms what I should expect for HDDs behaving normally in my cluster or should I expect the <10ms that I'm seeing with my temporary USB HDD? Could 200+ ms latency be an indication of a drive failure?

Thanks!
 
HDD OSDs should always have their RocksDB on faster storage (SSD) if you want to use them for anything but cold storage.
They will be used primarily for cold storage and for some large continuous writes so I don't need them to be super speedy. I have the SSD based pool for anything I need to be particularly fast.

However currently with 9 HDD OSDs across 4 nodes (3/3/2/1) I am getting a max of ~55MB/s on a large file transfer and ~100MB/s on a rados benchmark test with 16 threads. I know some of this is due to my non-ideal setup during the transition, but I want to get a better idea of what I should expect from my final setup.
 
In such a small setup you basically get the performance of a single HDD as max performance of the cluster. I would not expect more than appr. 120MB/s here.
I have seen single HDD OSDs with a max of 15MB/s write speed because of the RocksDB load on the readwrite head. This generates so much random write IO that a HDD cannot cope physically.
 
In such a small setup you basically get the performance of a single HDD as max performance of the cluster. I would not expect more than appr. 120MB/s here.
I have seen single HDD OSDs with a max of 15MB/s write speed because of the RocksDB load on the readwrite head. This generates so much random write IO that a HDD cannot cope physically.
So even with large sequential writes RocksDB generates a bunch of random write IO? What has a biggest bang for the buck, more nodes or more OSDs?

So it sounds like even though I plan on using this mostly for cold storage/sequential writes I need to test out reserving some of my SSD storage for WAL/DB.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!