Hello !
I've been struggeling with one of my servers for several months now.
There seems to be a hardware conflict related to SATA. My ZFS RAIDZ (4x 10TB HDD in 2 mirrors) claimed that there was 1 disk failure. I replaced the disk and shortly after, 2 more disks seem to fail (one in each of the mirrors). I assumed, that this is not at disk problem, and started to search for other issues.
In the comming weeks, I replaced:
1) All SATA cables
2) The PSU*
3) The mainboard, including the chasis
4) The CPU
*I should mention that since changing the PSU did not help, I reinstalled the original PSU.
So now, I have pretty much a new server. Nothing seems to help. Here are further things I tried:
1) I ran 4 hours of RAM test. No errors found.
2) Booted from a live-USB: Linux Mint. Same issues found.
3) I upgrade proxmox to 8.2.4.
4) I tried systematically all Linux-kernels that are available in the server.
5) Ran smartmontools on all hdds. Some of the short tests, and all of the long tests got "Aborted by the host".
6) Attached all hdds to a workstation. Conducted smartmontools long tests (something between 8-14 hours). No errors founds, all healthy.
7) Ran many, many scrubs on the ZFS on the server. In the beginning, inconsistent data was found and dealt with. By now, all data is deleted.
Even if unlikely, I then figured that several of the hdd's are actually broken. So I bought 2 more. Including an elderly hdd I had lying around, I have now 7x10TB. I attached one at the time to the server, creating several hours of load with the tool "fio". 4 disks indicated no error of any kind. I used them, created the ZFSz-Raid and started restore from backups. 2 days after, while still restoring, I got the zpool-error "disk unavailable". I removed the troublemaker, created another type of Raid with the remaining 3 HDDS, and started restoring VM's again. Shortly after, I got the error about slow SATA response. Restore-process got cancelled, but the zpool seems healthy.
This is just a quick summary of what has happened. Thanks for reading. even more thanks, if you can suggest a solution.
I've been struggeling with one of my servers for several months now.
There seems to be a hardware conflict related to SATA. My ZFS RAIDZ (4x 10TB HDD in 2 mirrors) claimed that there was 1 disk failure. I replaced the disk and shortly after, 2 more disks seem to fail (one in each of the mirrors). I assumed, that this is not at disk problem, and started to search for other issues.
In the comming weeks, I replaced:
1) All SATA cables
2) The PSU*
3) The mainboard, including the chasis
4) The CPU
*I should mention that since changing the PSU did not help, I reinstalled the original PSU.
So now, I have pretty much a new server. Nothing seems to help. Here are further things I tried:
1) I ran 4 hours of RAM test. No errors found.
2) Booted from a live-USB: Linux Mint. Same issues found.
3) I upgrade proxmox to 8.2.4.
4) I tried systematically all Linux-kernels that are available in the server.
5) Ran smartmontools on all hdds. Some of the short tests, and all of the long tests got "Aborted by the host".
6) Attached all hdds to a workstation. Conducted smartmontools long tests (something between 8-14 hours). No errors founds, all healthy.
7) Ran many, many scrubs on the ZFS on the server. In the beginning, inconsistent data was found and dealt with. By now, all data is deleted.
Even if unlikely, I then figured that several of the hdd's are actually broken. So I bought 2 more. Including an elderly hdd I had lying around, I have now 7x10TB. I attached one at the time to the server, creating several hours of load with the tool "fio". 4 disks indicated no error of any kind. I used them, created the ZFSz-Raid and started restore from backups. 2 days after, while still restoring, I got the zpool-error "disk unavailable". I removed the troublemaker, created another type of Raid with the remaining 3 HDDS, and started restoring VM's again. Shortly after, I got the error about slow SATA response. Restore-process got cancelled, but the zpool seems healthy.
This is just a quick summary of what has happened. Thanks for reading. even more thanks, if you can suggest a solution.