zfs degraded - too many errors - smart ok

geronimobb

Well-Known Member
Apr 22, 2017
35
1
48
50
Hy,

One of my pools (not a boot pool) is in a degraded state because one the drives has 'too many errors', 39 cksum mentoned, 0 write 0 read.
I checked the drive with smart values show nothing.
I did a zpool clear, and is now resilvering, and has 155 cksum allready.
Any ideas? Thanks in advance for any tips/help.
Grtz
 
Hi


I guess the memory is broken.

It could be RAM as @wolfgang say. But you can also investigate others things:

1. sata cables
- stop the server
- replace the sata cable(use sata cables with metal clips) for the HDD with zfs errors
- start server, and hope ;)

2. sata port
do the same steps as 1, and use another sata port from your MB/controller
-hope again ;)

3. Replace the PSU
-hope again ;)

Good luck !
 
Thanks all for the replies.

After i did a 'zpool clear' a scrub was started. After it finished, the errors were still there and the pool still in a degraded state. The scrub had repaired 11G. Since i rebooted allready several times before the 'zpool clear' command, and i had tried to re-attach the (same) cables, i didn't expect a new reboot would make a difference, but i'm quite suprised, all the errors are gone and the pool is online...
The only change i made was in the bios of the lsi sas, from 'boot os only' to 'boot bios & os', which makes the drives initialize before starting the operating system i guess, but not sure. So i'm not sure if that made the difference.

I am however now not sure if the data repair by the scrub did not damage data? But since there were errors at that moment, i suppose what was corrected was right. I started a new scrub...

I don't think the memory is faulted, because i did not get any warnings via IPMI (or in the logs). And i ran a memtest also, but for sure not long enough. I do remember two months ago a disk from the same pool was once offline, a simple reboot solved it. This happened twice. After i installed a lsi sas controller (IT mode), so maybe the psu could be the cause? The system is build on a supermicro A1SRM-2558F.

I suppose i can not do much more at the moment but wait to see if i happens again or not. I had allready ordered a replacement drive, but i don't think something is wrong with the disk.

Any suggestions or ideas are still wellcome!
Thanks for the advise.
 
Lucky or thunder before the storm ;-)

I think i'll replace my psu. Cabling should be replaced too i suppose. If it continues i'll swap the controller and pool to another system. Then i will suspect the mobo/ram.

I saw in the logs one trace of read/write errors on all drives, just for a brief moment during the rebuild.

Or it stays ok and i'm just lucky ;-)
 
Update: the cksum errors started appearing on another pool, only 3.

I decided to power the server directly from the ups, and rewired inside the server the power lines from the psu to the disks. It appeared all the disks were on 1 line of the psu. So i divided them equally over the available lines.
After some days of observing and testing, no more errors.
Maybe to early to say victory, but at least a good evolution!

Thanks for the help!
 
  • Like
Reactions: Deepen Dhulla