zfs degraded - too many errors - smart ok

geronimobb

Well-Known Member
Apr 22, 2017
35
1
48
50
Hy,

One of my pools (not a boot pool) is in a degraded state because one the drives has 'too many errors', 39 cksum mentoned, 0 write 0 read.
I checked the drive with smart values show nothing.
I did a zpool clear, and is now resilvering, and has 155 cksum allready.
Any ideas? Thanks in advance for any tips/help.
Grtz
 
Hi


I guess the memory is broken.

It could be RAM as @wolfgang say. But you can also investigate others things:

1. sata cables
- stop the server
- replace the sata cable(use sata cables with metal clips) for the HDD with zfs errors
- start server, and hope ;)

2. sata port
do the same steps as 1, and use another sata port from your MB/controller
-hope again ;)

3. Replace the PSU
-hope again ;)

Good luck !
 
Thanks all for the replies.

After i did a 'zpool clear' a scrub was started. After it finished, the errors were still there and the pool still in a degraded state. The scrub had repaired 11G. Since i rebooted allready several times before the 'zpool clear' command, and i had tried to re-attach the (same) cables, i didn't expect a new reboot would make a difference, but i'm quite suprised, all the errors are gone and the pool is online...
The only change i made was in the bios of the lsi sas, from 'boot os only' to 'boot bios & os', which makes the drives initialize before starting the operating system i guess, but not sure. So i'm not sure if that made the difference.

I am however now not sure if the data repair by the scrub did not damage data? But since there were errors at that moment, i suppose what was corrected was right. I started a new scrub...

I don't think the memory is faulted, because i did not get any warnings via IPMI (or in the logs). And i ran a memtest also, but for sure not long enough. I do remember two months ago a disk from the same pool was once offline, a simple reboot solved it. This happened twice. After i installed a lsi sas controller (IT mode), so maybe the psu could be the cause? The system is build on a supermicro A1SRM-2558F.

I suppose i can not do much more at the moment but wait to see if i happens again or not. I had allready ordered a replacement drive, but i don't think something is wrong with the disk.

Any suggestions or ideas are still wellcome!
Thanks for the advise.
 
Lucky or thunder before the storm ;-)

I think i'll replace my psu. Cabling should be replaced too i suppose. If it continues i'll swap the controller and pool to another system. Then i will suspect the mobo/ram.

I saw in the logs one trace of read/write errors on all drives, just for a brief moment during the rebuild.

Or it stays ok and i'm just lucky ;-)
 
Update: the cksum errors started appearing on another pool, only 3.

I decided to power the server directly from the ups, and rewired inside the server the power lines from the psu to the disks. It appeared all the disks were on 1 line of the psu. So i divided them equally over the available lines.
After some days of observing and testing, no more errors.
Maybe to early to say victory, but at least a good evolution!

Thanks for the help!
 
  • Like
Reactions: Deepen Dhulla

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!