Typical Ceph OSD Commit Latency for HDDs?

bqq100 · Dec 15, 2022

What is the typical commit latency I should expect from HDDs in a cluster?

I'm currently in the process of migrating from my ZFS pools to a Ceph pools. For the initial migration I have a random mix of hardware. Once the migration is complete, I'll shutdown the ZFS pools and migrate that hardware over to the ceph pools. I have both SSD and HDD OSDs/replication rules/pools with some pools running 3 copies and some running 2 copies, depending on how critical the data is and my uptime requirements.

My SSD OSDs/pools seem reasonably fast and I have no major issues, but my HDD OSDs are all over the place. Right now I see 6 OSDs running 55-85ms, but 2 running in the low 200's with spikes up to 300ms. And 1 OSD is running at <10ms (and this is a USB HDD setup and it's the only OSD on a 1G connection!). One of the 2 slow disks I can explain because it is currently running double duty for ZFS/Ceph. The other disk though is a dedicated WD Red CMR 4TB disk. The hours on the disk are pretty high, but SMART doesn't show any issues and it constantly passes short/extended tests.

Is the 55-85ms what I should expect for HDDs behaving normally in my cluster or should I expect the <10ms that I'm seeing with my temporary USB HDD? Could 200+ ms latency be an indication of a drive failure?

Thanks!

gurubert · Dec 15, 2022

HDD OSDs should always have their RocksDB on faster storage (SSD) if you want to use them for anything but cold storage.

bqq100 · Dec 16, 2022

gurubert said:
HDD OSDs should always have their RocksDB on faster storage (SSD) if you want to use them for anything but cold storage.

They will be used primarily for cold storage and for some large continuous writes so I don't need them to be super speedy. I have the SSD based pool for anything I need to be particularly fast.

However currently with 9 HDD OSDs across 4 nodes (3/3/2/1) I am getting a max of ~55MB/s on a large file transfer and ~100MB/s on a rados benchmark test with 16 threads. I know some of this is due to my non-ideal setup during the transition, but I want to get a better idea of what I should expect from my final setup.

gurubert · Dec 19, 2022

In such a small setup you basically get the performance of a single HDD as max performance of the cluster. I would not expect more than appr. 120MB/s here.
I have seen single HDD OSDs with a max of 15MB/s write speed because of the RocksDB load on the readwrite head. This generates so much random write IO that a HDD cannot cope physically.

bqq100 · Dec 20, 2022

gurubert said:
In such a small setup you basically get the performance of a single HDD as max performance of the cluster. I would not expect more than appr. 120MB/s here.
I have seen single HDD OSDs with a max of 15MB/s write speed because of the RocksDB load on the readwrite head. This generates so much random write IO that a HDD cannot cope physically.

So even with large sequential writes RocksDB generates a bunch of random write IO? What has a biggest bang for the buck, more nodes or more OSDs?

So it sounds like even though I plan on using this mostly for cold storage/sequential writes I need to test out reserving some of my SSD storage for WAL/DB.

Search

Search

Typical Ceph OSD Commit Latency for HDDs?

bqq100

Member

gurubert

Distinguished Member

bqq100

Member

gurubert

Distinguished Member

bqq100

Member

We value your privacy