I also encountered this problem. I would like to ask if I restart the osd after it occurs, and the alarm will disappear after a few days. If this problem occurs in an osd in the future, restart the osd to temporarily solve the problem.Just to chime in, I am also seeing this on my homelab. Everything was fine until I upgrades to 19 (from 18). And I think there is an issue, not just alerting, as when it gets bad enough, the MDS servers start getting "cranky" (slow ops), and won't become healthy again until I restart the OSDs that are running slow.
ohmmm...I know you don't want to hear it but this is a SSD drive. and befor 18.2.6 it works without issueLooks like you may have a bad drive, these ops complete in 120ms which is long even for spinning hard drives, with a spinning drive, you would expect <20ms +/- network latency of ~1-2ms. Are you using SMR drives? These seem to be mostly around the time you are rebuilding an OSD, which can indeed put very high load on both drives and network.
Again, either you are severely overloading your network leading to packet drops at this time, or your drive is failing causing some operations to take very long, you won’t notice in most cases as Ceph will redirect operations that don’t complete in time, but rebuilding does require the drive to be functional.
Given this is generally around commit time to disk, I would suspect the disk.
Hello,What brand and model? 120ms is a really long time, for SSD you would expect <2ms. I don’t think it was ever ‘without issue’, you just never noticed or the new versions have a slightly different load pattern that trigger it, or you added more load. You can upgrade to Squid and see if it improves any, but the logs are pretty clear.
Thank you we have planned to update the firmware for most of the disks from M3CR043 to M3CR046This is a common issue with the (nearly decade old) MX500s. They are not very good in general, even for desktop use, they have major firmware issues, glitch out even in desktops. You can see if updating the firmware resolves the issue, but also check your SMART values (smartctl -a /dev/xxx) I will guess you have tons of pending sectors, they are probably 'worn out' to some extent as well. One of the recommendations is to start the computer but not start using them for about a minute so the firmware can boot up properly.
the issue was with crucial CT240BX500SSD1What brand and model? 120ms is a really long time, for SSD you would expect <2ms. I don’t think it was ever ‘without issue’, you just never noticed or the new versions have a slightly different load pattern that trigger it, or you added more load. You can upgrade to Squid and see if it improves any, but the logs are pretty clear.
We use essential cookies to make this site work, and optional cookies to enhance your experience.