SATA issues

dejhost

Member
Dec 13, 2020
63
1
13
45
Hello !
I've been struggeling with one of my servers for several months now.

There seems to be a hardware conflict related to SATA. My ZFS RAIDZ (4x 10TB HDD in 2 mirrors) claimed that there was 1 disk failure. I replaced the disk and shortly after, 2 more disks seem to fail (one in each of the mirrors). I assumed, that this is not at disk problem, and started to search for other issues.

dmesg slow sata.jpeg

In the comming weeks, I replaced:
1) All SATA cables
2) The PSU*
3) The mainboard, including the chasis
4) The CPU

*I should mention that since changing the PSU did not help, I reinstalled the original PSU.

So now, I have pretty much a new server. Nothing seems to help. Here are further things I tried:
1) I ran 4 hours of RAM test. No errors found.
2) Booted from a live-USB: Linux Mint. Same issues found.
3) I upgrade proxmox to 8.2.4.
4) I tried systematically all Linux-kernels that are available in the server.
5) Ran smartmontools on all hdds. Some of the short tests, and all of the long tests got "Aborted by the host".
6) Attached all hdds to a workstation. Conducted smartmontools long tests (something between 8-14 hours). No errors founds, all healthy.
7) Ran many, many scrubs on the ZFS on the server. In the beginning, inconsistent data was found and dealt with. By now, all data is deleted.


Even if unlikely, I then figured that several of the hdd's are actually broken. So I bought 2 more. Including an elderly hdd I had lying around, I have now 7x10TB. I attached one at the time to the server, creating several hours of load with the tool "fio". 4 disks indicated no error of any kind. I used them, created the ZFSz-Raid and started restore from backups. 2 days after, while still restoring, I got the zpool-error "disk unavailable". I removed the troublemaker, created another type of Raid with the remaining 3 HDDS, and started restoring VM's again. Shortly after, I got the error about slow SATA response. Restore-process got cancelled, but the zpool seems healthy.


This is just a quick summary of what has happened. Thanks for reading. even more thanks, if you can suggest a solution.
 
Hi,

Test each of yout hdd with problem with badblock(who will write each block, and then it will check if the data is ok)

Good luck / Bafta !
 
If you use the same HDD models and maybe the same production series it‘s not unlikely that additional drives will fail shortly after. Rebuilding RAIDs (especially with parity) puts a lot of stress on the disks.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!