What is the typical commit latency I should expect from HDDs in a cluster?
I'm currently in the process of migrating from my ZFS pools to a Ceph pools. For the initial migration I have a random mix of hardware. Once the migration is complete, I'll shutdown the ZFS pools and migrate that hardware over to the ceph pools. I have both SSD and HDD OSDs/replication rules/pools with some pools running 3 copies and some running 2 copies, depending on how critical the data is and my uptime requirements.
My SSD OSDs/pools seem reasonably fast and I have no major issues, but my HDD OSDs are all over the place. Right now I see 6 OSDs running 55-85ms, but 2 running in the low 200's with spikes up to 300ms. And 1 OSD is running at <10ms (and this is a USB HDD setup and it's the only OSD on a 1G connection!). One of the 2 slow disks I can explain because it is currently running double duty for ZFS/Ceph. The other disk though is a dedicated WD Red CMR 4TB disk. The hours on the disk are pretty high, but SMART doesn't show any issues and it constantly passes short/extended tests.
Is the 55-85ms what I should expect for HDDs behaving normally in my cluster or should I expect the <10ms that I'm seeing with my temporary USB HDD? Could 200+ ms latency be an indication of a drive failure?
Thanks!
I'm currently in the process of migrating from my ZFS pools to a Ceph pools. For the initial migration I have a random mix of hardware. Once the migration is complete, I'll shutdown the ZFS pools and migrate that hardware over to the ceph pools. I have both SSD and HDD OSDs/replication rules/pools with some pools running 3 copies and some running 2 copies, depending on how critical the data is and my uptime requirements.
My SSD OSDs/pools seem reasonably fast and I have no major issues, but my HDD OSDs are all over the place. Right now I see 6 OSDs running 55-85ms, but 2 running in the low 200's with spikes up to 300ms. And 1 OSD is running at <10ms (and this is a USB HDD setup and it's the only OSD on a 1G connection!). One of the 2 slow disks I can explain because it is currently running double duty for ZFS/Ceph. The other disk though is a dedicated WD Red CMR 4TB disk. The hours on the disk are pretty high, but SMART doesn't show any issues and it constantly passes short/extended tests.
Is the 55-85ms what I should expect for HDDs behaving normally in my cluster or should I expect the <10ms that I'm seeing with my temporary USB HDD? Could 200+ ms latency be an indication of a drive failure?
Thanks!