[SOLVED] WD RED degrades when used as OSDs

bly

Member
Mar 15, 2024
56
15
8
Hi all,
I have 3 brand new WD RED 2T disks I addes as OSDs in my ceph, and from them I receive occasional "slow operations in BlueStore" health warnings.

I noticed the situation worsen when I, as example, put OUT one OSD and the PGs starts being transferred to the other WD RED OSD in the same machine, which not only slows operation, but also I notice the response times going up to the insane 3000ms and more from that disk while the PGs transfer is running.

One machine had all its two OSDs on WD RED and was the worse performing one, as a test to rule out other reasons I replaced one WD RED with a Samsumg EVO 870 2T disk and after this change only the WD RED is still giving me headaches.

I suspect something wrong is with the WD RED series altogether but if any of you have other hints let me know. TIA!
 
While monitoring disks, I noticed that WD RED are 5-8° "hotter" than Samsung's and normal operations are making 45° the Samsungs and 53° the WDs.

I know max allowed temp is 70° for both but I think termal throttle starts at about 60°, it may be the case.
 
I don't know if this applies here, but depending on the firmware of the drives, single I/Os can take a long time and are not predictable. This is often a problem with ZFS and could also be problem here. Enterprise HDDs have a maximum return time and therefore a predicable response time. If it takes longer than that, the I/O fails. Consumer drives try "harder" to read the data and therefore take longer time (in seconds instead of ms). This will create strange application hangs. I experienced this with ZFS and can be monitored with iostat. Look for i/o times there. Haven't used harddisks with ceph, therefore I do not know if this also applies here.
 
if you are really talking about a WD RED 2T, then likely it is using SMR, which makes it pretty much unusable for hypervisor workloads..
 
  • Like
Reactions: bly and gurubert
Try lsblk -o+MODEL to see the exact model. It's also shown in node > Disks.
 
I did found nowhere about they were SMR but I confirmed their behavior when iops starts be large on writes.
Will replace them.
 
Please share their model number for completeness. See message above for how to get it.
 
those are ssds. there is no smr on ssds because there is no magnetic recording.
so that can be ruled out as cause.
the thing is that wd red ssds are still consumer ssds even if they are labelled as nas-ssds.
they lack critical features such as plp which can increase performance and decrease wear drastically when used with zfs (my knowledge about ceph is lacking, so i dont know the behaviour there).
 
  • Like
Reactions: bly and leesteken
WD Red SA500 2.5 2TB and WDS200T2R0A-68CKB0 on the label.
I did found nowhere about they were SMR but I confirmed their behavior when iops starts be large on writes.
SMR is for rotating HDDs these are TLC-flash SATA SSDs. QLC-flash is terrible with sustained writes (go down to KB/s) but most TLC-flash consumer drives work "well enough" for homelabs (but I have no experience with Ceph).
Will replace them.
Try (second-hand) enterprise SSDs with Power Loss Protection (PLP) as all other SATA TLC-flash SSDs will probably be similar. Maybe search the forum for drive suggestions?
 
Last edited:
  • Like
Reactions: bly